- Educational Measurement 
ana 


Evaluation 


sanevensaeuenasunasenenee 


Epucation rog Livine SERIES 


UNDER THE EDITORSHIP OF 
H. H. REMMERS p 


C——————— HH — 


E 
Educational Measurement 
ur ier d 


. _ Evaluation 


gnimssenspesesu su enenton nadesneensuenasonsen sveneensnsueeennesensearenenusanesueneaneneuesennstnneneonssnses 


H. H. REMMERS 
Professor of Education and Psychology and 
Director, Division of Educational Reference 

PURDUE UNIVERSITY * 
vysogand ! ; 
N. L. GAGE 
Associate Professor of Education 
+ Bureau of Research and Service 
UNIVERSITY OF ILLINOIS 


Publishers 
NEW YORK LONDON 


meosnsosssnnnoonan 


; HARPER é* BROTHERS 


EDUCATIONAL MEASUREMENT AND EVALUATION 
^ 
Copyright, 1943, by H. H. Remmers and N. L. Gage. 


Printed in tbe United States of America E 
All rights in this book are reserved. "i 
No part of the book may be reproduced in any 
manner whatsoever without written permission "M 


except im the case of brief quotations embodied ) 
in critical articles and reviews. For information & t 
address Harper © Brothers. v 
, A 

"m 


E-C 


BPP RE! ENS De" 
REM T. 


Bureau Edni. Psy. Research 
DAVID HARE TiAINING. COLLEGE — 


Dab ed e eas eod pucr aee 
Aces. No MPa 2 


ERIT IITIIIEEEEEEE pr 


Contents 


Foreword 


. Why Evaluate 


Part. One: What Should Be Evaluated? 


Introduction 


. Achievement of Instructional Objectives 
. Physical Aspects of Pupils 

. Mental Abilities. ` 

. Emotional and Social Adjustment 

. Attitudes 

. Environment and Background 


Part Two: How to Evaluate 


Introduction 


. Achievement Testing 

. Constructing Short-Answer Tests 

. Choosing Standardized Tests 

. Product and Procedure Evaluation 

. Essay Testing 

. Evaluating Physical Aspects of Pupils 
. General Mental Abilities 

. Special Abilities 

. Adjustment 

. Attitudes and Related Aspects 


v 


vi Contents 


XVIII. Environment and Background 
XIX. The Teacher 
XX. Administering the Evaluation Program 
XXI. Interpreting Scores 
XXII. Interpreting Scores: (Continued) 
Appendix 
Index 


eveeessensseustsuesvensuroussusscusssnzsssnussoeucnsscnsusnsssneqensessst@\Sss0Hss000s00000000000000000000000000000 


Foreword 


THE REASONS FOR WRITING BOOKS ARE MANY AND VARIED—AS VARIED 
as the individual writers, Apart from ego-inflationary reasons which 
are generally not socially acceptable, the following constitute a 
framework of principles which have motivated and guided the 
writing of this book. 

1, The modern psychological concept of the whole individual— 
not just the academic to-be-taught-subject-matter individual—in 
his various socially significant dimensions needs to be guided and 
stimulated to grow in terms of these dimensions. 

2. The constant interaction between the individual and his en- 
vironment in all relevant dimensions, the needs of the individual 
and the needs of the society in which he lives—these constitute the 
frame of reference for the scope of this book. Hence in addition to 
the traditional topics of “achievement” and “intelligence” testing 
there are included such chapters as those dealing with the physical 
aspects of pupils, their emotional and social adjustment, their 
attitudes and their environment and background, together with 
a chapter on the evaluation of the teacher as a highly important 
part of that environment. 

3. This entails psychological understanding and insight on the 
part of teachers and counselors. Hence the first part of the book 
is concerned not a little with psychological theory. Education will 
have juice, bounce, and vitality to the extent that teachers are 
themselves well educated and oriented to the world in which they 
live, including especially the youngsters whom they teach as part 
of that world. 

4. Theory without practice is futile and practice without theory 


vi 
e 


` Foreword 


becomes dangerous immediate time-serving. In one sense, to be 
sure, we are all time-servers. It is only in the time perspective that 
we differ—in the units of time that we serve. Human finiteness 
forbids the presumption of attempting to build for eternity, but 
human needs require greater than day-to-day foresight. Hence the 
second part of the book presents practice in harmony with the 
theory of the first part, designed to implement not only the day-to- 
day purposes of educational endeavor but also the long-time view 
with which education worthy of the name must concern itself. 

5. There is at present little uniformity in the prerequisite re- 
quirement of elementary statistical concepts for courses in evalua- 
tion and measurement. Reasonable mastery of such statistical con- 
cepts is, however, a necessity for any safe use of measurement and 
evaluation. devices and techniques. This dilemma has been re- 
solved by placing at the end of the book two chapters covering 
the minimum essentials of statistics which teachers in training are 
likely to master. The instructor can if he chooses teach this chapter 
first, last, or in parallel with the other chapters requiring statistical 
insight. 

6. The book makes no pretense of being a handbook for guidance. 
It is rather concerned with providing the methods and techniques 
for obtaining the facts and evaluative data necessary for valid 
guidance. 

The authors’ intellectual debts are many and varied. Grateful ac- 
knowledgment is due Doctor Frank Peyton, who read. and criti- 
cized the manuscript of Chapter III particularly with a view to its 
orientation to the medical profession. Doctor Arthur E. Traxler, 
Associate Director of the Educational Records Bureau, made valu- 
able constructive suggestions on Chapters I to XV. Professor Harry 
C. Steinmetz, Head of the Department of Psychology of San Diego 
State College, gave generously of his experience and insight for the 
major part of the manuscript. Students too numerous to mention 
who as teachers in service took the course for which this book was 
prepared made many suggestions that found their way into the 
manuscript. Any errors of either omission or commission are not, 
however, chargeable to these friends and colleagues. For these the 
authors must be held responsible. In the typing of the manuscript 
Miss N. Elizabeth Brant gave highly competent service. 


Foreword ix 


ACKNOWLEDGMENTS 


The authors’ thanks are extended to the following for permission 


to quote from their publications: 


American Association of School 
Administrators 

American Child Health Assso- 
ciation 

American Council on Education 

D. Appleton-Century Company 

Professor A. S. Barr, University 
of Wisconsin 

Bulletin of Purdue University 

Bureau of Publications, Teach- 
ers College, Columbia Uni- 
versity 

Children's Bureau, U. S. De- 
partment of Labor 

College of Education, 
State University 

Co-operative Test Service 

F. S. Crofts and Company 

The Thomas Y. Crowell Com- 
pany 

Duke University Press 

Ginn and Company 

Graduate School of Education, 
Harvard University 

Harcourt, Brace and Company 

Harper & Brothers 

Henry Holt and Company 

Houghton Mifflin Company 

McGraw-Hill Book Company 


Ohio 


The Macmillan Company 

National Council of Teachers of 
English 

National Federation of Modern 
Language Teachers 

Nervous and Mental Disease 
Publishing Company 

Ohio College Association 

Personnel Research Foundation 

Public School Publishing Com- 
pany 

Purdue Research Foundation 

Regents Inquiry into the Char- 
acter and Cost of Public Edu- 
cation in the State of N. Y. 

Russell Sage Foundation 

Charles Scribner’s Sons 

Eugene R. Smith, Beaver Coun- 
try Day School 

U. S. Employment Service 

U. S. Office of Education 

University of London Press 

University of Minnesota, Com- 
mittee on Educational Re- 
search 

Warwick and York, Inc. 

Western State Teachers College, 
Kalamazoo, Michigan 

World Book Company 


PART ONE 


What Should Be Evaluated? 


CHAPTER I 


Why Evaluate 


“TO ENABLE THE RIGHT PUPILS TO RECEIVE THE RIGHT EDUCATION FROM 
the right teachers" may be considered the aim of the good educa- 
tional system. Although at first glance this definition seems merely 
a truism, it does provide the basis for an approach to the question 
of why pupils should be evaluated. For the present discussion, 
directed toward classroom teachers, let us assume that the right 
teacher has been discovered and is operating, and even becoming 
“tighter” as he reads this book. But the right teacher must be con- 
cerned whether he has the right pupils and is giving them the right 
education. And to satisfy these concerns he must evaluate his 


‘students and discover whether the learning process in which he is 


leading them is bearing desirable fruit for them individually and 
for the society they should serve. The right pupil is the pupil whose 
personal attributes and opportunities enable him to profit fully from 
the education offered him. The right education is the education best 
suited to the needs of both society and the pupil. Obviously, the 
task of determining this “rightness” of pupils and educations is a 
task for the process of evaluation. Pupils’ attributes and opportu- 
nities must be ascertained. Pupil behavior must be evaluated at all 
stages of the interaction between pupil and education to determine 
the fitness of one for the other. 

Contemporary education is divided, at all age levels, into numer- 
ous subject-matter fields, such as mathematics, social studies, and 
the like, Whether the teaching process itself is thus segregated, or 
whether it encompasses all subjects in one activity program, the 
conceptualization of the modern educational curriculum ifito “sub- 
jects” is inevitable, for man’s interests, activities, and problems are 

Li 


2 Why Evaluate 


themselves thus divisible. And in each of these subject-matter areas 
it is possible for many varied mental processes, such as rote memory 
or logical inference, to take place. Each of these mental. processes 
has some value, great or small, in each of the subject-matter arcas. 
The differences between subject-matter areas in the degrees to which 
they involve various mental processes, in the societal needs they 
satisfy, in the vocations to which they are most relevant, in the 
personal values or types to which they appeal, all operate to render 
the modern school curriculum a heterogeneous, variegated thing. 
In the same way, the economic and social activities in which 
pupils may now or in the future engage differ among themselves 
very widely. Some jobs require great muscular strength and little 
mental ability, while others require the opposite combination, Some 
social activities and roles require a knack for leadership or domi- 
nance, while others can be undertaken successfully only by adaptable 
followers. For all the thousands of different occupations, recrea- 
tions, and civic activities engaged in by our millions of citizens 
there are needed different combinations of strengths and weak- 
nesses in all the various “mental” and “physical” aspects of persons. 
Similarly, pupils differ, both within themselves and among them- 
selves, in all aspects of their personalities and especially in all those 
dimensions which determine their fitness for success in the various 
subject-matter areas and social roles afforded by contemporary ŝo- 
ciety. Pupils differ, like all living things, in the behavior and re- 
sponses of which they are capable, in intelligence, interests, and op- 
portunities. These individual differences result in differences in the 
types of activities, educational and vocational, from which pupils may 
best profit and in doing which they may best serve society. To illus- 
trate, let us examine an extreme case of physical differences. Society 
does not fit the weak or the feminine into activities requiring great 
physical stamina. Invalids do not work as lumberjacks. In the same 
Way, society must prevent the intellectually weak from filling posi- 
tions requiring intellectual strength. And the centers which society 
provides for cultivating the mental strengths it needs to keep itself 
going—that is, the schools—are the places where this process of 
fitting pupils to curricula, and citizens to social roles, is carried 
oe with the best results for individual adjustment and social wel- 


Why Evaluate 3 


In the schools this fitting process is guidance, ie, the guidance 
of pupils differing in all ways within and among themselves into 
activities, curricula, and vocations which differ within and among 
themselves in the capacities they require, so as to make the best fit 
between pupil and education. 


RzouisrrEs oF GUIDANCE 


Guidance requires the evaluation of pupils so that their specific 
capacities, both strengths and weaknesses, may be determined, As 
pupils proceed up the educational ladder in elementary school, 
with its general curriculum providing the common core of skills 
(reading, writing, arithmetic, etc.) that everyone in our civilization 
must acquire, the process of evaluation must operate upon them so 
as to reveal differences in their aptitudes, abilities, achievements, 
interests, environmental backgrounds, and all other relevant at- 
tributes. It is during this elementary schooling that pupils set off 
on diverging paths, some toward one set of interests and goals and 
some toward another. By the time they reach secondary schools the 
pupils on the various paths are already far apart from one another 
in their educational needs and capacities. At this point the knowl-' 
edge of the pupil which the teacher has gained during the ele- 
mentary schooling should crucially affect the decision of the pupil 
(and his parents) as to the type of secondary education to be selected. 
"This knowledge must be the result of continuous evaluation through- 
out the six or eight years of the elementary educative process. And 
the evaluation must have been comprehensive throughout those 
six or eight years, involving as many important aspects of the 
pupil's personality as it is possible for a teacher to perceive. Similarly, 
guidance and evaluation should operate during and at the end of 
the secondary schooling. 

The continuity of the evaluation process implies that it should go 
on during all the time that the teacher can observe the pupil, not 
only on the special occasions when tests are given or report-card 
grades are determined. Every recitation, every assignment, every 
conversation, every behavioral detail that the teacher can observe 
should be material for the evaluation process and the basis for a 
record whereby he may accumulate knowledge of the pupil and 
pass this accumulated record of evaluations and evidence on to the 


e 


4 Why Evaluate 


-pupil's next teacher. The specific content and techniques of con- 
tinuous evaluation in the actual school situation are the substance 
of later chapters of this book. At present our thesis is that guidance 
requires of the teacher a continuously evaluating point of view. 

The comprehensiveness of the evaluation process refers to its ex- 
tent over the whole personality of the pupil, rather than merely his 
intellectual achievement. Evaluation of a pupil's knowiedge is, of 
course, extremely important in guiding him. But also important 
for his happiness and for fitting him into the world of work is 
evidence concerning his aptitudes and interests, his temperament, 
his attitudes, his social adaptability, his habits of work and play, 
his physical characteristics. Each of these aspects of the pupil’s per- 
sonality, and many more, must be evaluated both with respect to 
its position in relation to other aspects or dimensions of the pupil 
and with respect to its position in relation to similar aspects of the 
population at large. 

With such. broadly continuous and comprehensive evaluative 
evidence concerning its pupils, a school can properly undertake its 
important function of guidance. Such guidance is not to be con- 
sidered as operating only at the students' entrance into or exit from 
a particular level of schooling or vocation, although this is its most 
striking occurrence. Rather guidance should operate within the 
Schooling period. For example, evaluations should assist in the 
solution of such guidance problems as choice of elective subjects. 
Should a pupil take a twelfth-year English course or instead a 
course in typewriting? Which laboratory science should a pupil 
elect, physics, biology, or chemistry? In which section of a course 
in tenth-year English, mathematics, or biology should a pupil be 
placed? Why is a pupil failing in history, in physics, or in Latin? 
Why is a certain pupil becoming a disciplinary problem? Answers 
to such questions, of which those here presented are only a small 
illustrative sample, all come under the broad heading of guidance. 
And the answers should not be given without a basis of valid evalua- 
tive data on the pupil. 

The philosophical basis of the need for educational measurement 
and evaluation may be found to a large extent in the writings of 
such men as. Thorndike (6 :67-70)* on the individualization of 

+ Numbers in parentheses refer to the Bibliography at the end of the chapter. 


Why Evaluate 5 


education, Cowley (x) on student personnel work, and, by contrast, 
in the completely intellectualistic emphases of Hutchins (4) and 
Flexner (2). A valuable summary of the conflicting points of view 
concerning whether schools should deal with the whole pupil or 
with only his intellect is given by Williamson (8 : 1-35) and should 
be consulted by any reader interested in further justification of the 
premises on which this volume rests. 

Tyler (7) especially has emphasized the need for evaluation, on a 
comprehensive, thoroughgoing scale, of the degree to which pro- 
gressive education achieves the objectives set up for itself. Through 
a cooperative attack by teachers, school administrators, parents, 
librarians, and other persons concerned with pupils, Tyler planned 
to secure the comprehensive appraisal of the changes in pupils in 
all aspects, both “tangible” and “intangible,” with which progressive 
education is concerned. 


Oruer PurrosEs or EVALUATION 


In the course of satisfying the need for data on which to base the 
guidance of pupils, the teacher will acquire evidence enabling the 
school to serve many other purposes. These purposes, for the main 
part more traditional than that of guidance, include the many 
“uses of tests and measurements” put forth since educational meas- 
urement began. These uses are usually ascribed to educational 
evaluation conceived more narrowly as the ranking of pupils with 
respect to individual differences in their achievement in a particu- 
lar course of study. Given the instruments, ie, the achievement 
tests, with which to make such rankings the teacher or administrator 
may then use them for the following purposes: 

1. To maintain standards 

. To select students 

. To motivate learning 

To guide teaching 

. To furnish instruction : 

. To appraise teachers, teaching methods, books, curricular 
content, etc. € à 
Each of these uses of tests, or reasons for evaluating pupils, can 

take several forms of varying validity and desirability. Conse- 


Dne yN 


e 


6 Why Evaluate 


quently, let us now subject them to a critical discussion to clarify 
them and furnish an understanding of their benefits and dangers, 
and of how they can best be realized. 

1. Standards are necessary to society in carrying on its social and 
economic life. In the production of economic goods and the per- 
formance of vital social services, standards define the minimum 
degrees of excellence which society can accept. Airplane mechanics, 
doctors of medicine, pharmacists, practitioners of law, teachers, 
dentists, engineers, plumbers, carpenters, and professionals in all 
the learned fields all must meet certain standards, as set forth, for 
example, in professional licensing examinations, before society will 
permit them to practice in their fields. Standards also exist at all the 
earlier stages of the educational ladder, in promotion from one class 
to another, in admission to secondary and higher education, in the 
granting of honors, in graduation, in transfer from one institu- 
tion to another. At all these junctures, standards furnish the means 
of social control, and evaluations become the means whereby so- 
ciety determines whether the standards are satisfied and the candi- 
date may “pass.” Such social control, and the implied maintenance 
of standards and evaluations necessary to enforce minimum degrees 
of attainment, where society can get along with nothing less, are a 
legitimate and valid use of educational evaluation. 

But when standards and evaluations are employed to enforce 
uniform curricula or courses of study, in violation of the fact of 
individual differences and of the truth that men can serve them- 
selves and society in many different ways, they become a hindrance 
to individual adjustment and the proper use of human resources. 
Illustrations of such misuse of standards and evaluations are 
numerous in the history of American education. Schools and col- 
leges with large proportions of failures and dismissals may be con- 
sidered to be perverting rather than maintaining their standards, 
through a failure to provide for individual differences. Schools 
which provide only one curriculum for students of varying abilities 
and interests are similarly abusing their standards. Thus the charge 
can still be justifiably leveled at many secondary schools that they 
are maintaining high but false academic standards by providing 
academic, or pre-college, training for students who will never, and 


Why Evaluate $ 


perhaps should never, go to college. Where standards thus become 
a matter of prescribing the bookish attainments suited to the se- 
lected pupils of forty years ago for the relatively unselected millions 
who go to school today, these standards and the evaluations and 
examinations which implement them will cause untold damage 
and misery to vast numbers of pupils at all levels of education and 
waste precious resources of human society, Such attempts to force 
the widely varying kinds and degrees of human capacities and in- 
terests into one uniform academic mold have thwarted and warped 
many pupils into social misfits. 

This misuse and potential danger of educational evaluation, when 
it is used to maintain standards in violation of individual differences 
and other than as indispensable social controls, is perhaps more a 
misunderstanding of curriculum construction than of the functions 
of evaluation. The argument against the use of evaluations to up- 
hold invalid standards should, however, prevent their misuse by 
teachers in this respect. The teacher will thus recognize the need 
for more and more standards, highly differentiated and adaptable to 
as many kinds and degrees of interests and abilities as the evalua- 
tion of students may reveal. And the major function of teacher 
evaluations will no longer be standards-maintenance but rather 
standards-revelation. Each pupil will be evaluated to discover the 
educational directions and distances he is best fitted to travel. This 
point of view in turn shifts much of the “standards” function of 
evaluation back into the realm of guidance. 

2. The selection of pupils as a function of educational evaluation 
is a form of guidance in reverse. That is, selection is usually made 
with the advantages of the institution, say college, as the primary 
consideration, with the advantages to students in being rejected as 
probable misfits only an incidental by-product of the selection. 
Such educational selection is becoming increasingly rare in Ameri- 
can educational institutions, being at present confined to a small 
minority of the institutions of higher learning and to the relatively 
small nufhber of private schools at the primary and secondary levels, 
At the college level perhaps the foremost agency of educational 
measurement for selection purposes is the College Entrance Examina- 
tion Board, which annually measures the scholastic achievement 


8 3 Why Evaluate 


and aptitudes of thousands of applicants for admission to its 
forty-odd member colleges. Within the undergraduate levels of the 
University of Chicago, the Board of Examiners provides educational 
measurements of achievement in the Junior College which are the 
sole determiners of admission to the junior and senior years of col- 
lege work. 

The use of educational measurements for selection purposes is 
usually based on the assumption that they enable prediction of 
future college success. It is in this respect that the selection function 
may be distinguished from the maintenance-of-standards function, 
in that the latter is concerned with the appraisal of achievements 
for their intrinsic worth to society or to the individual rather than 
as predictors of future achievement. The substantial correlation 
coefficients obtaining between measures of pupils in high school 
obtained with various evaluating devices and measures of those 
pupils’ subsequent college success furnish evidence for the validity 
of educational evaluation in the selection of students. 

Nowadays education through the secondary level is given to ap- 
proximately 75 per cent of the children of high school age, and the 
state universities are required by law to admit to college work 
anyone possessing a high school diploma. Clearly the use of evalua- 
tions for educational selection must increasingly give way to the 
guidance function. Democratic social forces have deprived educa- 
tors of the right to say “Yes” or “No” to the individual’s plea for 
educational opportunity. Rather the educator must now increasingly 
answer, “Yes—What kind can you use?” And he must ask this 
latter question through the use of educational evaluations in as 
scientific and objective a manner as possible. His answer will be in 
terms of guidance among the various curricula of secondary school 
and college, with only the higher levels of college and professional 
work definitely closed to those whom educators consider unfit. At 
the University of Minnesota, for example, the General College repre- 
sents the notion of educational opportunity for all even at the 
college level. Into it are admitted all those who do not qualify, on 
the basis of various evaluative devices, for admission to the College 
of Sciente, Literature, and the Arts, and at the same time are in- 
terested in and capable of using two years of general education 
beyond the high school. The General College thus provides an 


Why Evaluate 9 


avenue into which may be guided all those whom selection at the 
college level has heretofore turned away from further education. 

3. The motivation of learning by examinations also possesses sev- 
eral aspects. Such motivation may be a matter of crystallizing into 
one test score the full weight of the social and self-criticism that the 
learning process needs. This use of evaluations is an application of 
Thorndike’s Law of Effect, in that the student comes to realize that 
certain types of behavior, say studiousness, will be rewarded in 
certain ways, say by high test scores; and conversely for other 
types of behavior. Secondly, in furnishing such motivation, the form 
of the test can have crucial effects on the learning process in all its 
aspects. What the pupil will study and seek to learn is determined 
largely by what he expects the measure of his learning to be. If past 
experience leads him to expect that his learning will be evaluated on 
the basis of how much rote memorization he has done, then rote 
memory will be his method of study. If, on the other hand, he ex- 
pects the test to be one of ability to apply principles, or of ability to 
interpret data, or of broad “acculturation” in the lore of a subject, 
or of acquisition of practical technical skills, then he will direct his 
study habits and learning processes toward these ends. Thus evalua- 
tion motivates learning not only toward the expenditure of quanti- 
ties of effort but also in the kinds of learning, of study habits, of 
educational outcomes toward which the effort is directed. 

From this distinction between the amount and the kind of 
motivation provided by evaluation, we can see both the dangers 
and the benefits latent in its use for the motivation of learning: 
evaluation of insignificant, temporary outcomes of education leads 
to undesirable study habits, while evaluation of outcomes which are 
meaningful, long lasting, and congenial to a pupil's interests and 
capacities leads to desirable study habits. And since what is meas, 
ured affects what will be learned, our determinations of what is to 
be measured are a vital part of the educational process, 

4. Evaluations can guide teaching when they furnish diagnoses of 
specific strengths and weaknesses in the pupil’s achievements or 
capacities. The teacher may then seek either to eliminate the weak- 
nesses by using special teaching methods and emphases, or to circum- 
vent them by directing learning toward areas where the pupil's 
efforts will be more fruitful. The causes of weakness in a specific 


e 


To Why Evaluate 


subject can be traced by tests to any one of the various possible 
pupil inadequacies underlying it. Thus weakness in a history or 
geography course may be found to be due to lack of either com- 
prehension power or speed in reading, as revealed by a test of read- 
ing ability. Or it may be due to lack of general intelligence, or to 
lack of the presupposed background material, a lack which could 
be revealed by tests designed to measure achievement at a lower 
level of the subject. Similarly, a pupil’s difficulties in arithmetic may 
be traced by means of tests to a specific inadequacy, such as in- 
ability to deal with certain combinations of digits, or lack of a 
correct technique for carrying, or the use of a cumbersome method 
of short division. Diagnostic testing may thus reveal the precise 
sources of a pupil’s shortcomings and guide the teacher to the 
optimum way of overcoming them, Inadequately equipped pupils 
may thus be studied and given special attention aimed at the 
basic roots of the “failure.” They may be shown in what parts of a 
subject or what outcomes of the course they are weak. Teachers 
may discover what parts of a topic or unit need to be retaught, or 
taught differently. Those pupils who are capable of doing excep- 
tional work may be discovered by cvaluation procedures and the 
teacher’s efforts may be guided toward special assignments and 
references for them for the fuller realization of their potentialities. 

Certain modifications in the construction procedure for an achieve- 
ment test are necessary to build into the test, by means of the 
classification of item content, the means whereby it can be used 
diagnostically. That is, if the test builder is careful to analyze his 
objectives and to apportion a sufficient number of specific items to 
each objective, the total test score may later be broken down into 
part scores which furnish measures of the specific outcomes. Gross 
disparities in a pupil’s achievement of the various objectives can be 
thus revealed and laid open to remedial treatment. 

5. In functioning as an instructional device, educational measure- 
ments should result not only in increased self-knowledge on the 
part of the pupil, but also in increased attainment of the specific 
objectives toward which a given educational process is intended to 
lead. Fo? example, a test in history can furnish instruction in the 
Present sense, not merely by giving the student an appraisal of his 


Why Evaluate J II 


knowledge of history but also by increasing his knowledge of his- 
tory. This function of measurement can be served best when it is 
divorced in large part from the process of evaluation. That is, ideally 
tests and measurements should not “count,” ie., affect grades, when 
they are used for instructional purposes. Rather the pupils should 
decide whether the tests are to be given; the taking of the test should 
be completely voluntary. The tests should be scored and returned 
to the pupil but the scores should not be recorded by the teacher. 
The pupil is thus provided by the total score with a means of self- 
appraisal and self-guidance and by. his own analysis of his responses 
with a detailed picture of just how he has succeeded or failed in 
attaining the various objectives of the teaching unit or course of 
study, 

The purpose of not recording the scores made on educational 
measurements used for instructional purposes is to shift the pupil's 
attitude toward his failures, errors, and inadequacies from one of 
“the harm is already done and there’s no use now in learning the 
answers to these flunked questions.” Instead the pupil takes an 
attitude which regrets the inadequacy as a real gap in his own 
personal equipment and as something to be remedied for future 
use. Learning, it has been shown, is facilitated by knowledge of 
its progress. The use of tests for instructional purposes takes ad- 
vantage of this principle of learning by giving the pupil, through a 
relatively painless appraisal of his present status, a springboard 
from which to strive toward greater achievement. 

6, The use of evaluations in the appraisal of educational instru- 
mentalities such as teachers, teaching methods, and textbooks is 
easily understood but not easily practiced. The principle under- 
lying all such uses of evaluations is that whatever yields the greatest 
realization of the educational objectives desired is the best teacher, 
teaching method, textbook or other instrumentality of the educa- 
tive process. And since most educational objectives may be stated 
in terms of desired changes in pupils, educational evaluations of 
those pupils will provide a measure of the degree to which the 
objectives are attained and of the effectiveness of the instrumentality 


' concerned. The difficulty in using educational evaluations for these 


purposes is that no worth-while, reliable results can be obtained 


12 Why Evaluate 


unless the appraisal becomes a controlled scientific experiment, with 
significant variables held constant and with statistical tests of sig- 
nificance applied to the results. The complexity and refinement of 
the procedures and precautions that are necessary may be appreciated 
from a perusal of recent textbooks (3, 5) in advanced statistics and 
in educational research. 

Factors pointed to as having to be held constant in order to 
justify ascribing an effect to any "treatment," such as teaching 
method, are, for example, intelligence of pupils, previous educa- 
tional background of pupils, study habits of pupils, home environ- 
ment of pupils, teacher skill, school conditions, and the like. The 
subtleties of experimental design that are involved in the valid use 
of educational evaluations for this appraisal research should pre- 
vent such evaluations from being used in this way by any but 
qualified research. workers. Principals, superintendents, and others 
who make appraisals must therefore either acquire a mastery of 
these scientific methods for themselves, or turn the problems over 
to experts, or cease to use evaluations for the appraisal of educa- 
tional instrumentalities. 


SumMary 


Determining the rightness of pupils and educations for one an- 
other requires evaluation. This is because of the heterogeneity of 
pupils in all aspects affecting their adjustment to themselves and 
society and the heterogeneity of educational and economic activities 
available for students in a democracy. The necessary evaluation of 
pupils must be both continuous and comprehensive if it is to serve 
optimally in its major role of enabling the guidance of pupils. The 
standards-maintenance function of evaluation, when properly con- 
ceived, becomes more of a standards-revelation function, The se- 
lection function is similarly yielding to the guidance function. The 
motivation function must pay heed to the quality as well as the 
quantity of the learning effort. The diagnostic function of revealing 
pupil strengths and weaknesses requires special techniques other 
than general single-score tests. The instructional function can be 
realized only when evaluations are allowed to be self-imposed self- 
criticisms. The appraisal function can be realized only in the form 
of controlled, statistically analyzed experimentation. 


Why Evaluate 13 


QUESTIONS 


. Enumerate arguments in favor of having schools concentrate on train- 
ing the intellect and then in favor of having them concerned with all 
aspects of pupils. 

. Discuss the extent and limits of any valid distinctions between “edu- 
cational,” “vocational,” and'“personal” guidance. 

. Is there a conflict between a teacher’s instructional duties and such 
continuous and comprehensive evaluation as this book advocates? 
Explain in terms of your conception of the major ends and means of 
schools, 

. Examine your own past experiences for occasions when you needed 
guidance. Was it given you by your teachers? By some other person? 
On what basis was the problem finally solved? How could evaluative 
data have furthered the solution? 

. List five guidance problems that might occur among pupils in your 
classrooms and the possible roles which evaluative data could serve in 
the solution of each. 

. Indicate for three different cases the ways in which standards might 
be adjusted to differences among pupils in those aspects affecting 
their chances of achieving the standards. Let the three cases be in the 
fields of physical, intellectual, and social behavior, respectively. — : 

- Give concrete illustrations of desirable and undesirable motivational 
effects of evaluation procedures in terms of differences in the compre- 
hensiveness and continuity of the procedures. 

. What is the difference between the functions of evaluation in guiding 
instruction and in furnishing instruction? 

. Why should a teacher's merit not be judged solely on the basis of his 
pupils’ standing on standardized achievement tests? Nor on the basis 
solely of their improvement on such a test? 


REFERENCES 


- Cowley, W. H., “The nature of student personnel work,” Educational 
Record, 17:222-223 (1936). 

. Flexner, A., Universities, American, English, German, New York: 
Oxford University Press, 1930. 

. Good, C. V., Barr, A. S., and Scates, D. E., The Methodology of Ed- 
ucational Research, New York: D. Appleton-Century Company, Inc., 


1936. 


14 Why Evaluate 


4. Hutchins, R. M., The Higher Learning in America, New Haven: 
Yale University Press, 1936. 

5. Lindquist, E. F., Statistical Analysis in Educational Research, Boston: 
Houghton Mifflin Company, 1940. 

6. Thorndike, E. L., Education, New York: The Macmillan Company, 

1914. 

Tyler, R. W., “Evaluation: a challenge to progressive education," 

Progressive Education, 12:552-556 (1935). 

8. Williamson, E. G., How to Counsel Students, New York: McGraw- 
Hill Book Company, Inc., 1939. 


7 


PART ONE 


What Should Be Evaluated? 


INTRODUCTION 


THE PRECEDING CHAPTER HAS ATTEMPTED TO ANSWER THE QUESTION, 
Why should pupils be evaluated? Having furnished some justifica» 
tion for concern with the evaluation of students, we must now ap- 
proach the question, What aspects of pupils should be evaluated? 
Our answer to this question must, of course, be determined in its 
major outlines by the answers we have given to the first question; 
that is, what we should evaluate depends on why we are evaluating. 
Since the major purpose of evaluation is to furnish data for guid- 
ance, in serving which we also serve the other functions of evalua- 
tion, the content of our evaluations should be whatever is needed 
as a basis for guidance. Since guidance seeks “the fullest realization 
by each pupil of his potentialities for desirable participation in the 
social order,” the data it needs are in turn’ determined by the needs 
of society on the one hand, and of the individual on the other. For 
example, society requires that its members be able to communicate 
with one another through spoken and written language; similarly, 
individuals adjusting in society must possess such language abilities. 
Innumerable other abilities, skills, habits, attitudes, informations, 
and so on are needed by both society and the individual. It is these 
mutual needs of both society and individuals that determine the 
aspects of pupils, or kinds of data, with which the evaluation of 
pupils must be concerned. 

Let us now turn to a listing of these aspects of pupils important to 
both society and the individual, and consequently of the data essen- 
tial to guidance: 

I. Achievement of instructional objectives 
2. Physical aspects 
3. Mental abilities, general and special 
4. Emotional and social adjustment o 
5. Attitudes 
6. Environment and background 
17 e 


18 What Should Be Evaluated? 


Discussions of the content of each of these aspects will be given 
in subsequent chapters. These discussions will seek to be primarily 
suggestive and illustrative while at the same time providing practical 


hints to teachers for their own thinking about each of these aspects 
of pupils. 


LÓ—€—— 88S NREEEARNERPRER RR ARE SERRA RERRERA PETI E ESS SERRE EPERRIRREEREEENRERSEHRTRETSI 


CHAPTER II 


Achievement 
of Instructional Objectives 


IN THE FIRST CATEGORY, ACHIEVEMENT OF INSTRUCTIONAL OBJECTIVES, 
will be put data concerning the degree to which the pupil has 
moved toward the objectives of the school. These objectives are the 
Goals in the direction of which educational processes seek to change 
Pupils. The “objective” is, of course, a normative concept, carrying 
Value implications of the good, the desirable, and reflecting the 
Purposefulness of the educational process, 

What are the objectives of schools, of specific courses, curricula, 
teachers? In what ways are these social instrumentalities trying to 
change pupils? This question, although it constitutes mainly a 
curriculum problem, is also basic in evaluating educational achieve- 
Ment. For not only the content and methods of instruction but also 
the evaluating instruments must be determined by the objectives 
Set up. The importance of objectives in designing evaluating in- 
Struments has been effectively stressed in the writings of C. C. 
Peters (14 : 148-159) and R. W. Tyler (18 : 5-6). This point of view, 
that the first and last steps in educational evaluation should be, 
Fespectively, the formulation of objectives and the validation of the 
evaluating instrument against the objectives is so appealing to 
common sense as to seem platitudinous. However, it is only since this 
Point of view has been adopted by evaluation workers that evalua- 
tion has begun to escape from the stultifying emphasis on one type 
of objective, memorization of information, that has characterized it 
until recently. As a result of looking to the objectives of instruction 
for the determiners of evaluation methods, there has been exposed 
the failure of the vast majority of these methods to deal with the 
many objectives long eulogized by curriculum builders, 

19 


L 


20 What Should Be Evaluated? 


Tue SPECIFICITY or INSTRUCTIONAL OBJECTIVES 


Emphasis on objectives as the springboard for the evaluation of 
educational achievement has come from evidence concerning the 
low correlations between achievements of various objectives. Thus 
the earlier standpoint of Wood (23:163), that high relationships 
were to be found between measures of information and of ability 
to think in a field, has been thrown into doubt. Different results 
have been found by various investigators. 

The following discussion shows the differences between those 
workers who have found high and low correlations between achieve- 
ments of various objectives of courses. Thus Johnson (8) concluded 
on the basis of correlational studies that the ability to acquire in- 
formation on the part of students in human biology, physics and 
chemistry, and basic wealth courses has been accompanied to a 
substantial extent by the ability to apply this information. Similarly 
Eurich (4) found that the thirteen subtests of a comprehensive ex- 
amination designed fo measure achievement of thirteen objectives 
of an English course were so intercorrelated as to “indicate prac- 
tically no relationship between achievement in terms of some 
objectives while between others it is relatively high.” (Italics ours.) 
Fotos and Remmers (5), working with objectives and examina- 
tions in French courses, obtained high intercorrelations between 
vocabulary, English to French translation, French to English trans- 
lation, and knowledge of verb forms; they conclude that “the lan- 
guage pattern tends to develop as a whole.” 

McConnell (12:3-8) used a more elaborate approach to the 
problem of differential achievement. In three subject-matter divisions 
he analyzed tests intended to measure achievement of three kinds 
of objectives: knowledge of vocabulary, knowledge of facts and 
principles, ability to apply facts and principles. His analysis was 
made not only by the usual method of intercorrelating the three 
tests, which yielded an average correlation of .66, but also by the 
method of item discrimination. That is, did test items for one ob- 
jective discriminate among students of high and low achievement 
better when the criterion of achievement was the total score in its 
own objective, or did these items discriminate better when the 
criterion was total score on a test devised to test some other ob- 


5 


1 


7 cst ^it 


Achievement of Instructional Objectives 21 


jective? He found that differences in item-discriminating power 
were greater between subject-matter- sections than between the 
sections for the three different objectives, This means that the three 
objectives differed less among themselves than the three subjects 
differed among themselves. 

A different group of researches has, however, emphasized the 
specificity of, or low correlation between, achievements of various 
objectives. Tyler's studies (22) in numerous courses at Ohio State 
University showed correlations of only about .40 between the scores 
- on recall of information tests, on tests that demanded the application 
of principles, and on tests of ability to draw inferences from new 
data. He concludes that students did not develop corresponding 
degrees of achievement in mere recall and achievement in higher 
mental processes of applying principles and drawing inferences, 
Similarly Brown (x :59-60) found that "in a college foods class 
the correlations between a knowledge of the scientific principles 
- underlying cookery and the quality of the food cooked, or the ability 
of people to manage their work in the laboratory, was less than .50.” 
Other studies, such as those by Amidon, Botto, Jones, Segner, and 
Brown, are also cited by Brown as having furnished evidence in- 
dicating low correlation between achievements of various objectives 
of home economics courses. 

"These differences in conclusions concerning the specificity of ob- 
jectives may be due to various factors. (1) Differences in instruc- 
tional emphasis may lead to either a high or a low relationship be- 
tween achievement of various objectives. If teachers consistently 
strive to relate facts and principles to their applications, and to 
relate practice to theory, their pupils may perhaps exhibit higher 

relationships between these objectives. (2) Also, as McConnell 
- (12:7) points out, “Research on the transfer of training suggests 
that the relationship between information and application would 
change with the difficulty of the application problems, and also with 
the degree of similarity between the original learning situations 
and those to which the facts and principles were to be applied in 
the examination. One might expect, also, to find a lower correlation 
between a knowledge of facts and principles tested. verbelly and 
applications tested in performance, than between knowledge and 
application when both were measured in purely verbal situations," 


TEN | | 
b TEE [7770071 No. RB. * 


22 What Should Be Evaluated? 


(3) Unreliability of the measuring instruments will also produce 
low correlations. 


Way Tzacuers Musr FORMULATE OBJECTIVES 


Objectives are determined, in the main, by the teachers of specific 
subjects when they construct courses of study. Thus Leary (xo) 
found that teachers participated in making 63 per cent of the 1660 
curricula she analyzed. Supervisors, principals, and superintendents 
also serve on many curriculum committees, while professors in in- 
stitutions of higher education act as consultants or provide, through 
graduate courses, inspiration or immediate opportunity for prac- 
ucal curriculum work. 

Most teachers operating in present-day schools can derive an 
original set of objectives from the course of study in the subject or 
subjects they are to teach, since only 13 per cent of this representa- 
tive sample of 1660 state, county, and city courses of study contained 
no statement of objectives. However, many teachers will find the 
objectives furnished by courses of study inadequate for one reason 
or another. Irrelevance to social and individual needs, vagueness, 
and ambiguity are perhaps the most frequent faults to be found in 
objectives contained in courses of study. And since objectives are 
necessary to valid instruction, evaluation, and guidance, many 
teachers will have to formulate their own sets of objectives, at least 
in part. ; 


GENERAL AND SPECIFIC OBJECTIVES 


Educational objectives may, of course, be classified in numerous 
ways. Perhaps the most useful distinction at this point is that be- 
tween general and specific objectives. This distinction is not a 
sharp one but rather may be considered to be continuous, allowing 
the possibility of having objectives at any point along a continuum 
from extreme generality to extreme specificity. General objectives 
are those which control the general learning situation, such as “a 
happy and useful life in society” or “the ability to reason scien- 
tifically.” Specific objectives, on the other hand, are the narrower, 
day-to-day goals, such as “the ability to use the possessive singular 
correctly,” or “the ability to make subjects agree with verbs in 
number,” or “the ability to use the tabular key in indenting for 


L1 5 


Achievement of Instructional Objectives 23 


agraphs,” or the “use of the proper form in introducing one 
erson to another." Both of these types are, of course, necessary in 
any clear formulation of objectives, since the achievement of general 
jectives depends upon the contribution made by each specific 


General Objectiye 


Fic. r.—Graphic illustration of rela- 
tionship between general and specific ob- 
jectives. 


tor in relation to every other specific factor. Fig. 1 gives a graphic 
lustration of this interdependence of general and specific ob- 
es. : 
hat the teacher is, therefore, often supplied with very inadequate 
tements of objectives is evident from the following findings of 
(10): "General objectives are included in 76 per cent of all 
€ courses analyzed. They represent the only kind of objective 
_ mentioned in 46 per cent of the courses. . . . Specific objectives are 
lated in 42 per cent of the total number of courses, and represent 
nly kind mentioned in approximately 10 per cent. .. . Both 
eral and specific objectives occur in 31 per cent of all the courses 
lyzed. . . .” On the basis of this analysis by general and.specific 
Cctives, we must again conclude that many teachers must 
ulate their own objectives. i 


udi 


# 


$ 


A 


24 What Should Be Evaluated? 


It may be argued that excellent teaching practices can exist with- 
out the conscious and explicit formulation of objectives. Thus the 
work of the interschool committees for evaluating the Eight-Year 
Study of the Progressive Education Association indicated that 
“vagueness of statement does not necessarily mean vagueness of 
purpose” (19). However, the clarification of objectives is necessary 
to the evaluation of teaching practices and pupil achievement, even 
when it'is not essential to instruction. And that such clarification 
will usually assist in teaching toward the objectives is also extremely 
probable by virtue of the directive influence which objectives exert 
upon the selection of content, activities, and experiences, 


How ro Formutate Osyecrives 


In order to assist teachers in formulating objectives for their own 
courses of study we shall here present, in order, (1) a discussion of 
the desirable form for statements of objectives, (2) a discussion of 
the desirable content of such statements, (3) zllustrations of objec- 
tives both for education in general and for special subjects. 

Form for Stating Objectives.—First, the objective should be 
worded in terms of changes expected in the pupil rather than as 
duties of the teacher, since attainment of objectives must in any 
case be evaluated in terms of pupil changes. The difference between 
the two ways of stating objectives may be seen from the following 
illustration. 


"Teacher's objective: To teach the knowledges, skills, habits, and attitudes 
h involving the multiplication table in arithmetic. 
Pupil objectives: 
Knowledge—To know the multiplication table. 
Skill—To be able to apply the multiplication table to problem- 
«solving. / 
Habit—To check carefully all answers derived from applications 
of the multiplication table. 
codd respect the power and value of the multiplication 
table. 
y 
Secondly, an objective should be in terms of observable changes in 
the pupil between the beginning and end of his experiences in a de- 


Achievement of Instructional Objectives 25 


fined segment of the educative process. Unless we can observe, by 
one of the various evaluation techniques, whether pupils are 
changed, we shall have difficulty in justifying an objective, however 
worthy it may appear on philosophical grounds. 

Thirdly, the terminology of the objective should be understandable 
and not vague; it should have its meaning defined if necessary in 
terms that pupils, parents, and other teachers can appreciate. This 
requirement, it is readily understood, can often be met only after 
much thought, discussion, questioning, wording, and rewording. 

Fourthly, each statement should be unitary and contain one ob- 
jective only, in order to prevent confusion and to facilitate ready 
identification of the objective. To illustrate, such an objective as 
"To be able to translate French into English and English into 
French, with correct use of idioms" is less clear than the statement 
of these objectives in the form of three separate objectives. 

Fifthly, objectives should be grouped for purposes of economy and 
clarity, and for use in guiding pupil activities, in the organization of 
units of work, and in the construction of evaluation devices. That is, 
specific objectives should be grouped under the objective that is 
general to them. For example, character traits which the school 
consciously aims to develop may be the headings for groupings of 
qualities exemplified in an individual possessing the trait, as fol- 
lows: "Industry—Persistent, steady, thorough, conscientious in work, 
wise in use of time. Cooperation—Obedient to accepted authority, 
Willing to assume individual responsibility, willing to sacrifice in- 
dividual desires for group interest, respectful of rules, etc.” 

Content of Statements of Objectives.—First, statements should 
contain only the real objectives of the course, and not those to 
which lip service only is given. Teachers may be tempted to formu- 
late a glib statement rather than to think through their own serious 
Purposes. Hence teachers should ask themselves whether they intend 
actually to do something relevant about each objective and about 
each phrase in the statement of the objective. 

Secondly, the list of objectives should contain references not only 
to the subject matter of a course but also to the mental processes to 
be applied to the subject matter. By mental processes are meant such 
types of behavior as remembering, reasoning, appreciating, being 
Interested, and so on. And, especially with reference to the higher 


. 


26 What Should Be Evaluated? 


mental processes, statements of objectives should be comprehensive. 
The higher mental processes are those to which the individual makes 
a large contribution through his own conscious effort; such mental 

as sensation or mere memory are considered to be lower in 
the intellectual hierarchy because the individual makes less con- 
tribution: he is more receptive than when he compares, infers, and 
abstracts. The value of the higher mental processes is largely a 
social one. As Judd (9 :4) says, “If by any means the educational 
system can discover how to promote even in the slightest measure 
the development of the higher mental processes, great advantage 
will be gained for civilization.” 

Differences of opinion exist among evaluation, workers concern- 
ing the relative merits of various “mental process” objectives. Tyler 
(20 : 15) and his co-workers include and emphasize in their lists of 
objectives such types as “reasoning, or scientific method, which in- 
cludes induction, testing hypotheses, and deduction.” These types 
have been derived from the demands of instructors in high school 
and college courses for tests which measure the attainment of the 
major outcomes they seck to achieve in their students. Wood and 
Beers (24), on the other hand, disparage the hope of teachers to 
“create thinkers,” and uphold instead the social value of the knowl- 
edge of facts. The evidence adduced by Tyler (21) concerning 
(1) the relatively low retention value of ability to recall information 
and the high retention value of such outcomes as ability to recall 
principles, to apply principles, and to draw inferences from new 
data, and (2) the low correlations (9 :6-17) (r equals about 40) 
between these types of achievement, is interpreted in different ways 
by these two schools of thought. On the one hand these findings 
are interpreted as showing that these tests of “higher mental proc- 
esses" really measure activities which are independent of the cduca- 
tive process; thus “the relatively slight loss indicated by se, 
tests given at intervals after completion of a course reflects the 
relative powerlessness of the teacher to alter the thinking ability of 
students” (24 : 494). On the other hand these findings are inter- 
preted as showing that these “higher outcomes” are more permanent 
results of teaching and consequently more worth striving for and 


evaluating. 
Whatever may be the true interpretation, the classroom teacher 


————— —9 


e 
| 


| 
| 
| 
i 


Achievement of Instructional Objectives 27 


will profit from the consideration and formulation of various mental 
process objectives as well as of the various subject-matter objectives 
into which course content may be analyzed. Such consideration, 
even if it does not improve thinking ability, will lead to more vital, 
interesting teaching and presentation of subject matter, in more 
varied forms and in forms more recognizable by pupils as related 
to their life needs. It will also lead to more effective construction of 
teacher-made evaluation devices. 

Thirdly, the content of statements of objectives should be deter- 
mined by community and individual needs, Such needs can be dis- 
covered by investigations, studies, and reports of the activities and 
interests of the pupils involved and of the social milieu in which 
they will have to function. Detailed description of these methods of 
curriculum-making would be out of place here. Many excellent 
works (2, 3) have been devoted to this problem and references to 
them may here serve the purposes of our discussion, 


Tiiustrations or Onyectives 

Before considering objectives for specific subjects it will be valuable 
to examine explicit attempts which have been made to formulate 
lists of general objectives for the work of the schools. Such formula- 
tions are, of course, the proper tasks of educational philosophers, ad- 
ministrators, and curriculum builders. The views of such thinkers 
reflecting the social forces of their environment have largely deter- 
mined the content of the teaching that goes on in the schools, But 
formulations of educational objectives have also been made by those 
concerned chiefly with the evaluation of pupil achievement. One 
such formulation, derived from work in constructing achievement 
tests for several departments of Ohio State University, has been 
presented by Tyler (18): 


Type A. Information, which includes terminology, specific facts and 
general principles, 

Type B. Reasoning, or scientific method, which includes induction, test- 
ing hypotheses, and deduction. 

Type C. Location of relevant data, which involves a knowledge of 
sources of usable data and skill in getting information from 
appropriate sources, 


. 


28 What Should Be Evaluated? 


Type D. Skills characteristic of particular subjects, which include labora- 
tory skills in the sciences, language skills, and the like. 

Type E. Standards of technical performance, which include the knowl- 
edge of appropriate standards, ability to evaluate the relative 
importance of several standards which apply, and skill or habits 
in applying these standards, 

Type F. Reports, which include the necessary skill in reporting projects 
in engineering or reporting experiments in science and the like, 

Type G. Consistency in application of point of view, which is most ap- 
parent in philosophy. 

Type X. Character, which is perhaps the most inclusive, involving many 
specific factors. 


Another list of objectives is that which resulted from a classifica- 
tion of the wide variety of objectives submitted by the thirty sec- 
ondary schools participating in the Eight-Year Experimental Study 
of the Progressive Education Association. In order to observe the 
effects of a different type of instruction, these thirty schools were 
freed of the influence of college entrance requirements upon their 
work by the colleges agreeing to accept all recommended candi- 
dates. Raths (15) was able to classify all of the objectives submitted 
by these schools under the following eight divisions: 


Functional information 

Aspects of thinking 

Attitudes 

Interests, aims, purposes 

Study skills and work habits 

. Social adjustment and social sensitivity 
7. Creativeness 

8. A functional social philosophy 


AVE wW Pm 


These attempts of evaluation workers to formulate lists of ob- 
jectives have, of course, both determined and reflected the actual 
content of courses of study now operating in schools. Leary’s 
analysis (10) of the objectives of 1660 courses of study thus shows 

e percentages of occurrence of various objectives as presented in 


Table x. 


Achievement of Instructional Objectives 29 


` 
TABLE r.— Percentage of Occurrence of Objectives Interpreted as 
Modifications of Pupil Behavior 
——— E etl aia 
Total Percentage 
Objectives in State, City and 
County Courses 


1, Attainment of knowledge and acquisition of skills 


in particular fields; (esee A 68 
2. Development of desirable attitudes, appreciations, 

and -understandings eSa Ne 0 sees esee 66 
3. Development of specific habits and abilities. ...... 45 
4. Promotion of enriched living and social well-being. 27 
5. Development of personality. ................... 21 


DisriNcrioN BETWEEN MEASUREMENT AND EVALUATION 


The comprehensiveness of these lists of general objectives, as con- 
trasted with the narrowness of the objectives the achievement of 
Which is measured even today by most educational tests used in 
schools, has been interpreted as a reflection of the increasing accept- 
ance of the “unitary,” “organismic,” or “integrative” view of hu- 
man behavior, as opposed to the “atomistic,” “mechanical,” or “addi- 
tive” view (11, 16). The organismic view, promulgated by such 
researchers as Jennings, Childs, and Cogswell in biology and 
Wheeler, Lashley, and the Gestaltists in psychology, holds that those 
descriptions of behavior which consist of merely piecing together the 
descriptions of numerous separate performances are inadequate. 
Rather, behavior must be appraised more nearly in terms of its 
totality, as a whole; and with reference to the values of society as 
these have been or can be formulated. As a compromise between 
this ideal and the sin of acting as if we had measured all of a pupil 
when we have interpreted the results of a few tests, there has arisen 
the movement toward more comprehensive measures and more valid 

terpretations of the results of our partial measures. It is this felt 
need for comprehensiveness that has caused the shift from the term 

measurement”—implying mathematically precise mensuration of 
acquired knowledge—to the term “evaluation,” which widens the 
areas to be studied to include subjective opinions and qualitative 


e 


30 What Should Be Evaluated? 


changes as well as objective and quantitative changes, to include 
changes in attitudes, appreciations, and understandings as well as 
acquisitions of knowledge and skills. The total personality of each 
child operating in the school and community must be observed in 
its relation to the impinging educative experiences. Consequently, 
educational evaluation has constructed comprehensive lists of edu- 
cational objectives toward which pupils should be guided and taught. 


SPECIFIC OBJECTIVES 


From the consideration of general objectives we may now proceed 
to formulations of objectives within specific subject matters. The 
comprehensive, organismic, unitary conception of educational ob- 
jectives must, of course, be carried over into each of these bodies of 
subject matter. On the one hand is the subject matter, as determined 
by textbook content, courses of study, definitions of requirements. 
On the other hand are the mental processes that can operate upon 
that subject matter, the outcomes of learning that a given teacher 
will seek to bring about in his pupils with respeċt to their behavior 
in that particular subject-matter field. The present consideration of 
objectives in terms of specific subjects rather than in terms of inte- 
grated, unified school experiences is perhaps not in accord with re- 
cent trends in curricular theory. But its conformity to actual practice 
is shown by the fact that the overwhelming majority (88 per cent) of 
courses of study are organized by single subjects or groups of unre- 
lated subjects (xo :25). Consequently our illustrations of objectives 
for particular teachers will be in terms of specific subjects. It is obvi- 
ously impossible for reasons of space and emphasis to present here 
the formulations of objectives for more than a few subjects. These 
will serve as illustrations of the results of this step in the construction 
of evaluation devices. With these illustrations and with further sug- 
gestions of sources for subject-matter objectives, the present discus- 
sion should furnish classroom teachers with equipment for formu- 
lating objectives to fit their particular situation, 

At the elementary school level, the subject matter may be classified 
into four divisions: 

1. Language arts—Handwriting, spelling, composition (writ- 
ten and oral), grammar, reading 
2, Arithmetic—Computation, problem-solving 


Achievement of Instructional Objectives 31 


3. Social studies—Geography, history, civics 
4. Others—Art, music, science, hygiene, home economics, indus- 


trial arts, etc. 


To illustrate the formulation of objectives at the elementary level, we 
shall present representative attempts in two of these subjects—read- 
ing and arithmetic—in specific school grades. 


OBJECTIVES OF INSTRUCTION IN READING 


Grade I 


A. 


Training in reading readiness, which involves (6 148-64) ac- 
quisition of the (1) intelligence, (2) visual and auditory per- 
ception, (3) language development, (4) background of expe- 
rience, and (5) social behavior necessary before a child can 
learn to read. Most of these objectives cannot be aimed at 
explicitly by the teacher, since their attainment is largely a 
function of certain processes of maturation which we have 
not yet learned how to affect with specific variations in en- 
vironment, or which, like home background, cannot be readily 
changed for instructional purposes. 


Grade III 
A. Know the sounds of all the letters of the alphabet and of most 
of the common phonograms. 
B. Be able to work out the pronunciation of unfamiliar words 
without help. 
C. Read silently with greater speed than is attained in reading 
orally, a 
D. Read silently without pointing or lip movements. 
E. Like to read, and read widely in varied sources. 
F. Be able to read fourth-grade material with satisfactory under- 
standing. 
Grade VI 


A. 
B. 


C. 


Read widely so as to extend and enrich experience. 

Continue improvement of basic skills in word recognition and 
comprehension. i 

Read silently with much greater speed than is attained in read- 
ing orally. 


32 What Should Be Evaluated? 


D. Acquire training in the use of books and in the location of in- 
formation in dictionaries and other reference works. 

E. Acquire skills involved in the reading of factual material. 

F. Achieve seventh-grade level in reading ability. 


OBJECTIVES OF INSTRUCTION IN ARITHMETIC 


In general, the objectives may be stated in two classes: computa- 
tion, as represented by the sixteen combinations possible between the 
four kinds of operations (addition, subtraction, multiplication, di- 
vision) and the four kinds of numbers (whole numbers, mixed num- 
bers, common fractions, decimal fractions); and problem-solving, 
which may be analyzed into five somewhat distinct abilities: (1) 
comprehension, (2) ability to determine what is given, (3) ability to 
determine what is called for, (4) ability to conceive the probable an- 
swer, and (5) ability to produce the correct solution. 

In selecting the teaching materials in arithmetic instruction the 
trend has been away from basing the selection on disciplinary value 
for the mind to basing it on the social usefulness of the materials. 
Also, demonstrations of the absence of correlation between arithmetic 
abilities with apparently similar materials, such as ability to add col- 
umns of two numbers and ability to add columns of three numbers, 
have caused curriculum makers to analyze arithmetic objectives in 
great detail. For example, Merton (x3) listed eighteen distinct ob- 
jectives in teaching the addition of whole numbers. 

At the secondary school level, the subject matter may be classified 
into eight divisions: 

1. English—Composition and literature 
2. Mathematics—Algebra, geometry, trigonometry 
3. Foreign languages—Latin, French, German, Spanish, Italian 
4. Social studies—Civics, American and world history, econom- 
ics, sociology, social psychology 
- Natural sciences—General science, biology, chemistry, phys- 
ics 
6. Commercial subjects—Bookkeeping, shorthand, typewriting, 
business practice 
9 eee economics—Clothing, foods, housing, family relation- 
ships 


Vv 


Achievement of Instructional Objectives 33 


8. Industrial arts—Mechanical drawing, shop work (metal, 


wood, etc.), related mathematics, related science, blueprinting, 
etc. 


To illustrate the formulation of objectives at the secondary level, 
we shall present representative attempts in two of the subjects. In 
English, the objectives of instruction have been stated as follows 
(7 + 381): 


A. Literature or assimilative objectives 


. Attainment of ability to read literary materials with facility 


and understanding 


- Development of critical judgment and appreciation of litera- 


ture 


. Enlarged acquaintance with literature and with literary his- 


tory 


- Broadening of experience vicariously through reading 
- Formation of desirable attitudes toward reading, namely, in- 


creased appetite and taste for what is good in literature 


. Attainment of some competency in the use of the resources of 


libraries 


B. Language or reproductive objectives 


I 


2. 


. Attainment of ability to speak and write intelligibly, agree- 


ably, and effectively 

a. Skill in correct language usage 

b. Power of expression 

Development of desirable attitudes toward the translation of 
experience into spoken and written words. 


In mathematics, the objectives of instruction have been formulated 
as follows (x7) : 
A. Utilitarian aims 


I. 
2. 


3. 


Skill in the fundamental processes of arithmetic 

Command of the language of algebra 

Such knowledge of the fundamental laws of algebra as will 
equip one to understand and use elementary algebraic meth- 
ods 


- Skill in interpreting graphical representation 
- Familiarity with the geometric forms common in nature, in- 


dustry, and life. This involves acquaintance with the more 
fundamental properties and relationships of these forms. 


E 


24 What Should Be Evaluated? 


B. Disciplinary aims 
x. The acquisition, in precise form, of those ideas and concepts 
in terms of which the quantitative thinking of the world is 
done, and the development of ability to think clearly in terms 
of such ideas and concepts 
The acquisition of mental habits and attitudes which will 
make the above training effective in the life of the individual: 
a secking for relations, an attitude of inquiry, a love of pre- 
cision, a desire for orderly and logical organization, etc. 
3. Training in thinking in terms of the idea of relationship of 
dependence 
C. Cultural aims 
1. Appreciation of beauty in the geometric forms of nature, art, 
and industry 
2. Ideals of perfection as to logical structure, precision of state- 
ment and of thought, logical reasoning, etc. 
3. Appreciation of the power of mathematics. 

These objectives of mathematics instruction differ among them- 
selves of course in their “immediacy,” those of Group A, Utilitarian, 
being far more directly associated with the actual subject-matter con- 
tent of courses of study than those of Groups B or C. This greater 
immediacy and ease of realization should not however decrease the 
teacher's efforts to realize and aim toward the other objectives. 


p 


SuMMARY 


Six aspects of pupils with which evaluation for guidance should 
be concerned are listed. The first of these, achievement of instruc- 
tional objectives, is considered in detail in this chapter. Objectives 
are defined and related to evaluation. The conflicting evidence con- 
cerning the specificity of achievement is considered and its implica- 
tions for evaluation are discussed. Teachers should usually formulate 
objectives because of the inadequacy of most statements of objectives 
furnished them. The distinction between general and specific objec- 
tives is not a sharp one but should be made to facilitate the compre- 
hension.of objectives. Explicit statements of objectives are even more 
essential to evaluation than to instruction, Objectives should be 
stated in the form of observable pupil changes, understandably and 


oe EE vll 


Acmevement of Instructional Objectives 35 


singly, and should be grouped. Statements of objectives should con 
tain only actual guiding purposes, should include references to men- 
tal processes, and should be determined by individual and social 
needs. Illustrations are given of statements of objectives for instruc- 
tion in general, for reading and arithmetic at the elementary level, 
and for English and mathematics at the secondary level. The dis- 
tinction between measurement and evaluation is drawn in terms of 
the latter’s greater attention to social significance and of its greater 
emphasis on comprehensiveness and continuity. 


QUESTIONS 


X, Distinguish between achievement of instructional objectives and such 
aspects of pupils as high mental ability, cultured family background, 
or preference for artistic rather than scientific activities, 

2. Contrast the implications for achievement evaluation procedures of 
the generality and specificity theories of achievement. 


* 3. Illustrate in terms of specific pupil behaviors and subject matters such 


different mental processes as ability to infer, ability to interpret, ability 
to recall, ability to recognize, desirable habits, and desirable attitudes. 

4. On what type of objective has most evaluational emphasis been placed 
in the past? For what reasons? With what desirable or undesirable 
results? 

5. Prepare a set of instructional objectives for a particular subject or unit 
at a particular grade level. 

6, Examine your own experience for instances where your learning ef- 
forts have been evaluated on a superficial or otherwise inadequate 
basis. How would improved evaluation have affected your education? 


! REFERENCES 


1. Brown, Clara M., Evaluation and Investigation in Home Economics, 
New York: F. S. Crofts & Co., 1941. 

2. Caswell, H. L., and Campbell, D. S., Curriculum Development, 
New York: American Book Company, 1935. 

3. Draper, E. M., Principles and Techniques of Curriculum Making, 
New York: D. Appleton-Century Company, Inc., 1936. 

4. Eurich, A. C., “Measuring the achievement of objectives in freshman 
English,” Studies in College Examinations, Minneapolis: University 
of Minnesota, Committee on Educational Research, 1934, pp. 50-66. 


‘ 


vil. 


v I2. 


13. 


4. 


15. 
16, 
17. 
18, 


What Should Be Evaluated? 


« Fotos, J. T., and Remmers, H. H., “The functional interrelationships 


of certain aspects of modern language learning,” Modern Language 
Journal, 18:481-493 (1934). 


. Harris, A. J., How to Increase Reading Ability, New York: Long- 


mans, Green and Company, 1940. 


. Hawkes, H. E., Lindquist, E. F., and Mann, C. R., Construction and 


Use of Achievement Examinations, Boston: Houghton Mifflin Com- 
pany, 1936. 

Johnson, P. O., “Differential functions of examinations,” Studies in 
College Examinations, Minneapolis: University of Minnesota, Com- 
mittee on Educational Research, 1934, pp. 43-50. 

Judd, C. H., Education as Cultivation of the Higher Mental Proc- 
esses, New York: The Macmillan Company, 1936. 


. Leary, B. E, A Survey of Courses of Study and Other Curriculum 


Materials Published Since 1934, U. S. Department of the Interior, 
Office of Education, Bulletin 1937, No. 31, Washington: Govern- 
ment Printing Office, 1938. 

Mathews, C. O., “Issues in the construction and use of educational 
measurements,” Journal of Educational Research, 33:452-456 
(1940). 

McConnell, T. R., “A study of the extent of measurement of differ- 
ential objectives of instruction,” in An Appraisal of Techniques of 
Evaluation, Symposium, American Educational Research Associa- 
tion, National Education Association, Washington, D.C., Feb. 26, 
1940. 

Merton, C. L., “Remedial work in arithmetic," Second Year Book of 
Department of Elementary School Principals, 1933. 

Peters, C. C., “The relation of standardized tests to educational ob- 
jectives," Second Yearbook of the National Society for the Study of 
Educational Sociology, New York: Teachers College, Columbia Uni- 
versity, 1929. 

Raths, L. E., “Evaluating the program of a school,” Educational Re- 
search Bulletin, 17:57-84 (1938). 

Smith, B. O., Logical Aspects of Educational Measurement, New 
York: Columbia University Press, 1938. 

The Reorganization of Mathematics in Secondary Education, Math- 
ematical Association of America, Inc., 1923. 

Tyler, R. W., Constructing Achievement Tests, Columbus: Ohio 
ay University, 1934. (Reprints from Educational Research Bulle- 
un. 


23. 


24. 


Achievement of Instructional Objectives 37 


Tyler, R. W., "Defining and measuring objectives of progressive 
education," Educational Record, Supplement No. 9; pp. 78-85, 1936. 


- Tyler, R. W., "Evaluation: A challenge to progressive education," 


Educational Research Bulletin, Volume 14, 1935. 


- Tyler, R. W., “Permanence of learning,” Journal of Higher Educa- 


tion, 4:203-204 (1933). 

Tyler, R. W., “The relation between recall and higher mental proc- 
esses,” in Judd, C. H., Education as Cultivation of the Higher Mental 
Processes, New York: The Macmillan Company, 1936, pp. 6-17. 
Wood, B. D., Measurement in Higher Education, Yonkers: World 
Bock Company, 1923. 

Wood, B. D., and Beers, F. S., “Knowledge versus thinking,” 
Teachers College Record, 37:487-499 (1936). 


WReRR RES LURE NAR ERERERRR SERE IRI EAR RN SERRE ERA SER TREES E RRETTRIII 


CHAPTER III 


————€———Á 


sueneetesseensseceesenssessss: 


Physical Aspects of Pupils 


IN THIS CHAPTER WE SHALL CONTINUE OUR ENUMERATION AND DESCRIP- 
tion of the aspects of pupils which should be evaluated as data for 
guidance. Also, since the material of this chapter represents a de- 
parture from what has traditionally been included in books on edu- 
cational tests and measurements, we shall present some of the rea- 
soning which has led to its inclusion and a discussion of how this 
work of the teacher should fit in with that of other school health 
agencies, 


Reasons ror Teachers’ Concern Wir Puris Puysicat Aspects 


Why should classroom teachers be concerned with the evaluation 
of the physical aspects of their pupils? Justification for such con- 
cern has traditionally been given in terms of the connection between. 
mind and body. The human body has been considered a machine 
through which the mind works. Mental activity has been recog- 
nized as inseparable from physical activity. Consequently it has been 
assumed that the quality and quantity of mental work depend as 
certainly upon the condition of the bodily machine as do the quality 
and quantity of work produced by any less complex, man-made ma- 
chine. Scientific research, however, has modified greatly our notions 
concerning this relationship. Although this research has been going 
on since the beginning of this century, it is not yet widely known 
or appreciated. The research findings which contradict common no- 
tions concerning the relationship between mind and body have been 
summarized by Paterson (8: 269-270) at the end of his exhaustive 
survey of the data, as follows: “. . . prevalent notions regarding the 
intimacy of the relationship between physical traits and intellect 

38 


o 


i 


Physical Aspects of Pupils 39 


have been greatly exaggerated. Search in the realm of gross anatomy 
for a physical correlate of intellect has yielded uniformly negative 
results... . Such structural characteristics as height and weight, 
++. head size and shape, . . . skeletal development measured by 
precise X-ray photography, . . . dentition, . . . physiological devel- 
opment measured in terms of pubescence, . . . complicated morpho- 
logical indices of body build, . . . deleterious physical conditions, 
++. glandular therapy, . .. disease processes [except when they] 
directly attack the central nervous system, especially the higher cen- 


ters...” have all been found to be relatively unrelated to mental 


development and temperament. 
However, although "body" and "mind" are not related in the 
Ways discussed above, there is still ample justification for the teacher's 


Concern with pupils’ physical aspects. Physical condition is obviously 


related to happiness, and consequently good health is something to 
be desired and achieved for its own sake. Also physical condition is 
directly related to the working efficiency of pupils and of workers 
in the economic world. Poor health causes absence from school and 
from work; it can decrease the amount of energy available for the 
learning process, for participation in social relationships, for the main- 
tenance of emotional stability, and for all the other forms of pupil 
development. 

Similarly, the physical aspects of pupils are directly related to their 
Vocational fitness, Since different occupations require different phys- 
ical capacities of those who work in them, realistic guidance must 
take the physical capacities of pupils into account. Numerous in- 
Stances of physical qualifications for occupations come to mind. The 
almost perfect physical equipment required for admission to the 
United States Military and Naval Academies; the perfect vision 
required of airplane pilots; the durable vision required of lawyers, ac- 
Countants, draftsmen; the physical and nervous stamina required of 
medical students and physicians; the pleasant, presentable physiog- 
omy and general appearance required of those engaged in occupa- 
tions involving direct, face-to-face contacts with other people, such 
as selling; the strength and endurance required of many industrial 
Workers—all these are instances of physical requirements for occupa- 
tons toward which many high school students have aspirations. 

Teachers can introduce a distinctly realistic note into their occupa- 


e 


40 What Should Be Evaluated? 


tional counseling of pupils by calling attention to the physical re- 
quirements for occupations and relating the pupils’ own physical 
status to them. 

Educational guidance, also, should proceed in terms of the findings 
from teachers' evaluations of the physical aspects of their pupils. Ob- 
viously a student's physical capacity determines his ability to acquire 
the education necessary for any occupational goal. The kind of cur- 
riculum in which a student may be advised to enroll is determined 
by the same sort of reasoning from curriculum requirements to his 
physical capacity as was described in our remarks on the relationship 
between physical status and occupational goal. The amount of school 
work and other activity in which a pupil should be encouraged to 
engage may similarly be determined by his physical status, A listing 
of the kinds of adjustments which may be made to health needs has 
been made by Brown (x: 40): 


1. Shortening of academic program 

2. Curtailment of outside activity : 

3. Withholding approval of athletics or permitting participation in 
accordance with individual needs 

4. Use of a convalescence room for students needing rest and re- 
laxation 

5. Lengthening the noon hour 

6. Mid-morning lunch of milk and orange juice recommended 

7. Correction of physical defects by outside medical aid 

8. Systematic health inspection and cooperation with health agen- 
cies 

9. Classroom adjustment for the physically handicapped. Allot- 
ment of front seats for the hard-of-hearing pupils and provision 
of special lighting for defective vision 

IO. Provision of corrective physical exercises for students who lack 
some muscular control, exhibit nervous movements and are 
slightly crippled 


Overview or PuysicaL Aspects or Pupris 


What do we mean by the physical aspects of pupils? These aspects 
have been listed in convenient form in a monograph by Rogers (10). 
His listing will be given here with notes concerning the frequency 
of occurrence of certain physical defects. 


Physical Aspects of Pupils 4r 


1. Growth.—Normal development in height and weight is a fun- 
damental indicator of normal health. The satisfactoriness of such de- 
velopment can, of course, be judged only roughly. Pupils must not 
be expected to conform exactly to the standards set in height-weight- 
age tables since differences of race and heredity will make pupils dif- 
fer in these respects, even if they are placed under the same condi- 
tions of living. A short stocky boy may be perfectly healthy and 
normal as to growth but in terms of the height-weight-age table be 
"too short" for his weight and age; and conversely for a tall slender 
boy or girl. Also, growth is not always regular; but if a pupil fails 
to gain weight over a period of two months, one should make sure 
there is nothing wrong. That is, the fulfillment of certain require- 
ments of height and weight is not in itself an indication of health. 
Size and weight measures are thus necessary but not sufficient indi- 
cations, 

2. Carriage.—The customary position of the head and shoulders 
and the general position in which the child places his body when 
standing or sitting may be interpreted by the teacher as an indica- 
tion of the state of general health. When not due to inheritance or 
to some bony deformity, bad posture is to be taken as a sign of 
fatigue or general weakness resulting from wrong feeding or some 
other condition of poor hygiene. 

3. The Skin and Hair.—Uncleanliness of the skin and hair is im- 
portant for guidance evaluation both for aesthetic reasons, in that 
it may affect the social adjustment of pupils, and for reasons of 
practical efficiency. Its importance in the latter respect may be judged 
from the following quotation: “The worst problem in health counsel- 
ing (in secondary schools) is that of skin disease, particularly 
Scabies, which causes more continued absences than any other dis- 
€ase" (5:88). Apart from the cleanliness of the skin and hair and 
the presence of such skin diseases as ringworm, impetigo, and 
Scabies, the teacher should attend to the general appearance and 
color of the pupil's skin. Cold or purplish appearance of the arms, 
hands, and lips suggests that something may be wrong with the 
organs of circulation, that the pupil is not properly fed, or that he 
is insufficiently clothed. ° 

4. The Eyes—Even today, perhaps, there are teachers who forget 
the importance of the pupil’s eyes. There is no excuse for a teacher’s 


e 


42 What Should Be Evaluated? 


ignorance of whether a pupil is handicapped by defective eyesight. 
Failure in school work, social maladjustment, and emotional in- 
stability can often be traced to visual defects. Nearsightedness, far- 
sightedness, astigmatism, squint-eyes, and cross-eyes are the major 
types of visual defect for which teachers should be on the alert. 

The size of the problem of student vision is indicated by the find- 
ing of Dr. Ruth Boynton, Director of the Student Health Service at 
the University of Minnesota, that approximately 33 per cent of 
students entering the University of Minnesota in the fall of 1936-37 
were found to have visual errors of sufficient severity so that they 
should have worn glasses (xx :516). Data for elementary and sec- 
ondary schools indicate a similar frequency of visual defects (9 : 6). 

5. The Ears.—Hearing is only slightly less important than vision. 
A considerable number of students have impaired hearing, and more 
are thus handicapped than we usually suppose. Boynton’s data, cited 
above, show that 5.4 per cent of her university students had impair- 
ment of hearing in both ears of such grade that it probably was some 
handicap. Slowness, dullness, or mistakes in a child’s schoolroom 
responses should make one suspect poor hearing; a special test 
seems almost indispensable. Subnormal hearing may be detected 
with either an audiometer, a watch, or the voice test. Actual ear 
disease is indicated by the presence of a discharge of light yellow 
material in the ear which may readily be seen by looking into it. 

6. The Nose.—Every pupil ought to be able, except when afflicted 
with a cold or hay fever, to breathe freely through his nose. Defec- 
tive breathing was found in 6 per cent of children in New York 
State, in 13 per cent of rural Indiana boys (Porter County Survey), 
and in 4 per cent of Detroit children (9 : 15). Although the condi- 
tion tends in time to improve, and practically disappears among 
adults, while it exists it is often a menace to health and to the in- 
tegrity of hearing. 

7. The Teeth.—While the decay of the first teeth is not a normal 
condition (though it is so common as to seem so), already at six 
or seven years of age uncared-for children usually have several 
decayed teeth per child, mostly of this first set, and treatment is 
well-nigh out of the question. The second set of teeth, however, 
begins to appear at this time and should be preserved if possible. 
The first of these "permanent" teeth appear just back of the 


2 
' 


Physical Aspects of Pupils 43 


s of the first set and often begin to decay within a few 
mths. Tooth defects are the most frequent kind of defect in 
l children, occurring in about 80 to go per cent of them. Yet 
dental profession claims that by recently developed methods 
tically all of the permanent set can be saved. By early detection 
treatment, probably go per cent of the caries, which was be- 
unavoidable and which we have been trying to prevent by 
brushing, is easily controlled (3). 

The Throat.—The condition of the tongue, the soft palate, the 
p and the tonsils is extremely important and not difficult for 
hers to evaluate. Boynton found that about 22 per cent of 
ersity of Minnesota students had tonsils that were enlarged or 
evidence of being infected. The figure for the University of 
rnia is similar, 30 per cent. “The examiners of nearly 600,000 
l children in New York State in 1923-24 reported 16 per cent 
having diseased or hypertrophied tonsils” (9 : 13). 

The Breath.—Rogers quotes a distinguished physiologist as 
ing that “bad breath has caused more misery than all the bad 
$ ever enacted” (10:16). The teacher who finds that a pupil 
as from day to day a breath with a foul odor should do what he 
in in a tactful way to see that the cause, whether in the mouth, 
se, or alimentary organs, is found and removed. 

10. The Neck.—The teacher should be concerned mainly with 
er the pupil has a wry neck, in which the head is drawn 
d one side and the face turns toward another; or whether he 
enlarged lymph glands which become visible as lumps on the 
lide of the neck at the edge of the band-like muscles which descend 
liquely on either side from just behind the ear to the upper 
n of the breastbone and the collar bone. Goiter is another 
malady for which teachers should be on the alert. 

l. The Chest.—Here the teacher should note the existence of 
s deformities, ability to breathe deeply, expansion of the chest 
y on both sides, normal speed in ordinary breathing, or the 
nce of a chronic cough. Breathing abnormally fast on slight 
ion and a purplish color of lips or hands both may indicate 
y working heart and hence should be referred to a physician. 
The Back.—A stoop, an angular projection of the back 
us" or "hunchback"), unequal height of shoulders, and un- 


© 


44 What Should Be Evaluated? 


equally outstanding shoulder blades are the main features which 
teachers should note here. In the Cleveland schools in one year 
0.5 of 1 per cent of the pupils in the first grade showed this condi- 
tion. 

13. The Legs.—The presence of a limp or of evident deformity 
should be noticed, as well as the shortening of one leg to a slight or 
large degree. 

14. The Feet.—Deformities of any kind, such as flat feet and 
calluses, and the shape, size, and condition of the shoes and stock- 
ings should be noted. Foot defects were found in 14 per cent of the 
boys and 18 per cent of the girls in the seventh grade in Cleveland. 

15. Clothing.—Neatness, cleanliness, suitability to temperature, 
and proper use of overshoes should be noticed by the classroom 
teacher." 

16. Communicable Diseases—Teachers can easily learn to recog- 
nize the general signs of sickness in pupils and the general signs 
and symptoms of such common school-age diseases as the following: 
measles, scarlet fever, diphtheria, tonsillitis, smallpox, chickenpox, 
mumps, German measles, and whooping cough. The periods of 
communicability and the incubation periods or the time required 
after exposure for the disease to develop are also valuable facts for 
teachers. 

A general picture of the relative frequencies of various physical 
defects among school children and college students may be obtained 
from Fig. 2. It will be noted that among draftees in 1917-18 the 
frequencies were much lower than those among the other two 
groups because of the lower military standards as to what con- 
stitutes a physical defect, 


RELATIONSHIP or TEACHER to OTHER HEALTH AGENCIES 


What should be the relationship of the classroom teacher to other 
social agencies which are concerned with pupils’ physical health? 
The medical and nursing professions are, of course, the best sources 
for evaluation of these physical aspects. But it is the high quality 

1 This list includes only those physical aspects for whose evaluation teachers are 
teadily qualified. A more extensive treatment, intended for the next higher level— 


that is, for physical educators and school nurses—is given in G. G. Deaver, Funda- 
mentals of Physical Examination, Philadelphia: W. B. Saunders Company, 1939. 


"E 


Physical Aspects of Pupils 45 


: DEFECTIVES BY SCHOOL, COLLEGE, AND DRAFT STANDARDS 


DEFECTS SCHOOL AGE |COLLEGE | DRAFT AGE 


Dental (800-900) 
Hyp. Tonsils 


—Flat Feet 
Visual Defects 
Underweight 


Nasal 
Obstruction 


Visual Def. 

Org. Heart Def. 
| Underweight 
[7 Tuberculosis 


Z Tonsils 
Flat Feet aH Hernia 
Hearing || Dental Def. 
Gps Hernia |— Hearing 
rg. Heart Def. | —Nasal 
Tuberculosis Obstruction 


Fic. 2.—Graphic representation of frequencies per 1000 cases of various 
Physical defects in individuals of school age, college age, and draft age 
(1917-1918), The defects of school children rank in frequency as shown 
in the left-hand column. 


of these professions and the large amounts of individual and social 
Tesources entering into the training of a practitioner ofeone of 
them that prevent medical seryice from being available in the large 
quantities necessary to satisfy all our needs for it. There cannot be a 


46 What Should Be Evaluated? 


physician in every classroom. The amount of trained medical help 
available to classroom teachers varies from zero in one-room 
country school houses, through a single trained nurse for an entire 
township school system, through one or more nurses and physicians 
for an entire school system, through a resident physician for a 
single school, to the elaborate, almost hospital-like student health 
services of our large state universities. What the classroom teacher 
will evaluate concerning the physical aspects of her pupils must 
vary in each of these situations. Consequently it is impossible to 
give a fixed set of rules for all of them. The medical profession 
does not consider teachers to be “meddling” when they evaluate 
the physical aspects of their pupils. On the contrary, as is seen in 
the following quotation from Rogers (10 : 1), it is highly favorable 
to this sort of evaluation: 


The committee on legislation of the White House Conference on Child 
Health and Protection urged the training of teachers for the detection of 
signs of communicable disease and of gross physical defects. One of the 
forty-eight physicians of this group remarked that the ability of the 
teacher in this field “is the keystone of medical inspection.” There can be 
no substitute for such service, for the appearance of communicable dis- 
ease in a schoolroom does not await the coming of a physician or nurse, 
and no one is in such a position of vantage for observing any lapse of the 
child from a condition which seems, for him, normal. 


Rogers also draws the following conclusion from his discussion 
of physical-mental relations: 


The thoroughgoing training school for teachers will include as a fun- 
damental in its curriculum the close observation of physical traits of the 
instruments (i.e. pupils) with which they are to work. The material to 
be studied is always at hand in the pupils of the training school, and such 
a course of physical examination may well supplement the didactic work 
in physiology and hygiene which it will serve to bring home to the stu- 
dent in a practical way. Nor does it require a long and laborious schooling 
to prepare the teacher in such physical appraisement, If nice distinctions 
were to be made in physical examination or decisions as to the treatments 
of diseases or defects, such would be the case, but these are not in the do- 
main of the teacher but are left to the physician or dentist. 

Teachers in service can be trained for this work by the school physician, 
who does not need to go searching for illustrations of physical defects. 


d 


Physical Aspects of Pupils 47 


While her powers of observation will be sharpened more rapidly by 
good training, the teacher, with the help of such directions as are offered 
herewith, can do effective work. 


A rich source of ideas concerning the knowledge that teachers have 
of the health deficiencies of their pupils and concerning teacher 
participation in the health program is to be found in the monograph 
by Franzen which reports on the School Health Study of the 
American Child Health Association (4)- Concerning the recogni- 
tion by teachers of the health needs of their pupils, Franzen con- 
cludes (4 : 19): 


1. Teachers have knowledge of only a very small proportion of the 
cases that are in definite need of professional attention. [See Table 2.] 

2. There is definite evidence that the teacher is able to perform a very 
useful function in initiation and reference. 

3. The teacher alone can combine selection, education and continuity 
of attention. She is best situated to refer cases for examination, use the 
material for habit structure, and maintain a continuing interest in the 
encouragement of corrections. 


Taste 2.—Recognition by Teacher of Health Needs 
(After 4:19) 


Percentage of 


Percentage of | Children Who 
P Children with | Were Reported 
EE S True Defect Who| by the Teacher. 
Defect as | No. Chil- | Cullen Truly | Were Reported | as Having Defect 
Defined dren Exam- AS 5 A by the Teacher | and Show the 
ined E (Efficiency Quo- | Defect by Meas- 
tient) urement (Accu- 
: racy Quotiens) 
Girls Boys | Girls Boys | Girls Boys. 
7366 16 1x 14 16 [21 59 
2510 14 14 Ir 14 55 64 
gitrelopment. -| 2338 18 17 | 16 xtd wi 45 
oU 4559 3 19 2 
SPE 281 ct DONT NG alin 
TRES 


48 What Should Be Evaluated? 


4. The great need at present is for instruction and supervision of the 
teacher by coordinated action of health education supervisor, nurse and 
medical examiner. 


'To this last conclusion we would add "teacher training institu- 
tions." 

In the School Health Study, of which Franzen's monograph is 
the final report, there were four major phases: (1) construction of 
health tests and their application to seventy groups of school chil- 
dren, (2) removal of extraneous influences and allowance for eco- 
nomic status, (3) selection of distinctive aspects of procedure which 
are correlated with school health tests, and (4) grouping of selected 
items and interpretation so as to yield the elements of school 
health procedures which bring school health results. In the fourth 
phase of the study an attempt was made to distill generalizations 
out of meaningful groupings of the aspects of procedure selected in 
the third phase. In these generalizations we find valuable indica- 
tions of the relationships which a teacher's health work should have 
to the work of other school health agencies. The first of these 
principles, characteristic of school health programs that actually 
produce results, is sympathetic cooperation between nurse and 
teacher (4:58). This includes items giving evidence of: 


Discussion and agreement between nurse and teacher about health pro- 
cedure and policy. 

Discussion and agreement between teacher and oral hygienist about 
health procedure and policy. 

Active records which involve a mutual recognition of the health needs 
of children by teacher and nurse. 

Classroom information included in nurse's home visit preparation. 

"Teacher initiative relative to the health of her pupils and reference to 
the nurse. 

Presence and participation of teacher in physical examination and of 
nurse in classroom inspections. 


The only aspect of nurse-teacher rapport that has a satisfactory 
frequency of favorable occurrence is initiative by the teacher and 
referenee to the nurse. The other five are all judged by Franzen to 
be in great need of improvement, at least in the presumably typical 
schools which participated in the School Health Study. 


p 


Physical Aspects of Pupils 49 


"The second generalization concerning the nature of the effective 
school health program embraces those features dealing with physical 
examination by a trained medical doctor. When all the items of 
such an examination are separated into those which do and do not 
have a measurable school health effect, it is found that those which 
do have such an effect are those that are characteristic of an ex- 
tended examination. No two-minute line-ups contain these desirable 
"items of examination—desirable because they have a proved relation 
to school health results as measured. 

Thus the thorough, adequate, effective physical examination in- 
cludes such items as individual hearing tests, palpation of cervical 
glands, accurate measurements of distances in hearing, and a dental 
examination made by a dentist rather than a physician and which 
makes a distinction between temporary and permanent teeth. These 
items were found to occur in less than half of the survey units in- 
cluded in the School Health Study. 

In order to increase the possibilities for more frequent examina- 
tions by physicians Franzen recommends that the relation between 
"teachers and physicians be changed so as to develop more fully the 
teacher's potentialities for referring cases to physicians. Teachers 
can be trained so as to increase the validity of their suspicions con- 
cerning pupils’ teeth, vision, hearing, nutritional status, and skin 
conditions. Thus aided by teachers’ referrals, physicians and nurses 
will be able to give adequate attention to the pupils who are referred 
rather than having to spend much time in necessarily superficial 
inspections. 

In order further to enable physicians to make fewer but more in- 
lensive physical examinations of pupils, it is recommended that 
teachers fulfill their responsibility for initiating and recommending 
pupils for physical examination by keeping histories based on an 
intimate knowledge of individual children extending over a long 
Tange of time. 

Thus the medical examination should be a subsequent reference 
Service. That is, measurement and “history” could better be supplied 
by others, such as teacher or nurse, thus permitting the physician to 
Concentrate on his appropriate function, diagnosis. e 

Franzen's third generalization embraces items concerning teachers’ 
knowledge, opinions, and beliefs. Not technical knowledge but 


z 


50 What Should Be Evaluated? 


rather methods of teaching, manner of approach, knowledge of 
children, opinion of relative value of health practices, progressive 
attitudes toward methods of teaching, and acquaintance with 
authoritative health books and journals stand forth as characteristics 
of teachers which discriminate best between programs that do and 
do not increase school health. It is along the lines of this third 
generalization that the teacher can most independently go about 
making his contribution to school health. No nurse or physician is 
necessary for a teacher’s effort to attain the enlightenment which 
Franzen found so highly related to the physical and mental health 
of pupils. 

Needless to say, this third generalization, based on scientifically 
evaluated observational evidence, constitutes direct and potent sup- 
port for the inclusion in this book of the present chapter and of a 
subsequent chapter on how to evaluate the physical aspects of 
pupils. 


Lecat CONSIDERATIONS IN THE EVALUATION oF Puysicat AsPEcts 


Political and social recognition of the importance of pupil health 
has led to the inclusion in the school laws of 42 states of references 
to the evaluation of pupils’ physical status. The following para- 
graphs from the school laws of the State of Indiana (7) may serve 
as an illustration of a permissive law: 


225. Medical inspection of children 
1. That all school trustees and township trustees are herewith 
permitted and recommended to institute medical inspection of 
school children at any time, the said trustees may require 
teachers to annually test the sight and hearing of all school 
children under their charge, the said test and uses thereof to be 
made according to the rules hereinafter authorized, 
226. Medical inspection defined 
2. The term, medical inspection, as used in this Act, shall be held 
to mean the testing of the sight and hearing of school children 
and the inspection ‘of said children by school physicians for 
“disease, disability, decayed teeth, or other defects, which may 
reduce efficiency or tend to prevent their receiving the full 
benefit of school work. 


Physical Aspects of Pupils 51 


- "These laws were written in 191r. Even at that time recognition 
given to the role of the teacher in evaluating the physical 
ts of pupils. The medical profession has frequently been 
trumental in bringing about the passage of such laws. Hoag and 
erman (6 :64) report that in 1906 when the legislature of Massa- 
tts was considering a mandatory provision by which vision 
hearing were to be tested by teachers, sittings were held during 
a mass of evidence as to the feasibility of the plan was 
ed by some of the best-known (medical) specialists of the state. 
The following quotation from Burks and Burks (2 : 156-157) is 
Iso illustrative of the essential lack of novelty in what has been 
lid here concerning the teacher's role in evaluating pupils’ physical 
itus. In 1913 they wrote: "Minnesota is even training her teachers 
) make general physical examinations, the claim being that trained 
chers can recognize go per cent of the defects of children... . 
varthmore College, recognizing the direct responsibility of teachers 
d the health of children, has actually added a course of medical 
on to its teacher's preparatory course. School children are 
ght to the college every week and the students under the super- 
on of the physical director are taught the methods of inspection 
examination." 
concluding this discussion of what physical aspects of the pupil 
d be evaluated, we refer the reader to Chapter XIII, which 
nts descriptions of the methods of evaluating, recording, inter- 
ng, and taking action on the evaluation of the physical aspects 


SuMMARY 


achers should be concerned with the physical aspects of pupils 
ause of any popularly assumed but undemonstrated relation- 
ps between "body" and "mind" but because of the relationships 
sical aspects to school attendance and performance, vocations, 
emotional and social adjustment. The nature of the physical 
cts with which teachers should be concerned is discussed. The 
er should accept the medical profession’s encouragement in 
valuating students’ physical aspects. Teachers have been found to 
ave little knowledge of their pupils’ physical defects. But they can 
be trained to fulfill adequately their proper functions of 


e 


52 


What Should Be Evaluated? 


cooperating with school nurses, being alert to pupils" physical needs, 
referring pupils to physicians, keeping histories of pupils, and mak- 
ing routine inspections and measurements. Evaluation of physical 
aspects is recognized as a function of teachers in the school laws of 
most states. This recognition is at least one generation old and has 
frequently been supported by the medical profession, teacher train- 
ing centers, and authorities on pupil health. 


QUESTIONS 


Discuss the values of teacher evaluation of physical aspects for the 
motivation and instruction of pupils. 

If you were teaching in a school that provided neither a school nurse 
nor a school physician, how would you undertake to compensate 
for this lack? 

Examine your own experience for instances when your teachers 
showed proper and improper regard for the physical condition of 
yourself or your fellow pupils. 

To what extent and for what reasons do you agree with the state- 
ment: “The physically handicapped pupil needs sympathy more than 
anything else"? 

Discuss the implications for teaching and teacher training of Fran- 
zen's generalization concerning the importance to pupil health of the 
teacher's knowledge, opinions, and beliefs. 


REFERENCES 


Brown, Marion, Leadership Among High School Students, Teachers 
College Contributions to Education, No. 559, New York: Teachers 
College, Bureau of Publications, Columbia University, 1933. 
Burks, F. W. and J. D., Health and the School, New York: D. 
Appleton and Company, 1913. 

Cross, H. D., "Possibilities in dental hygiene,” American Journal 
of Public Health, 16 : 234-236 (1926). 

Franzen, Raymond, An Evaluation of School Health Procedures, 
New York: American Child Health Association, 1933. 

Hess, Frederick, “Health counseling service at Louisville,’ Pro- 
ceedings of the Conference on Guidance and Student Personnel 
Work, Evanston: Northwestern University, 1936. 

Hoag, E. V., and Terman, L. M., Health Work in the Schools, 
Boston: Houghton Mifflin Company, 1914. 


? 


Physical Aspects of Pupils Meus 
y, Floyd L, School Laws of the State of Indiana, Depart- 
blic Instruction, Indianapolis, Indiana, 1939. (The laws 
d under his direction.) 

D. G., PAysique and Intellect, New York: D. Appleton- 
Company, Inc., 1930. 

+» Physical Defects of School Children, U. S. Department 
ior, Office of Education, School Health Studies, No. 15, 
Government Printing Office, 1925. 

. F., What Every Teacher Should Know About the Physical 
of Her Pupils, Department of the Interior, Office of 
Pamphlet 68, €: Government Printing Office, 


on, E. G., How to Counsel Sido, New York: McGraw- 
Company, Inc., 1939. 


CHAPTER IV 


Mental Abilities 


sanenanennaersnssennnssnserseusensusesssenesnessrsassuseneessenesusessusssonensnnasnuntonsenssenen 


WHAT DO WE MEAN BY “THE MENTAL ABILITIES OF PUPILS”? SCIENTIFIC 
psychology has not been able to find any real agreement within itself 
upon the answer to this question. That is, the nature of mental 
ability, although a very practical matter, is still largely in the realm 
of theory. This statement, however, is not as pessimistic as it may 
sound, for a similar statement can be made concerning basic 
physical concepts such as electricity or magnetism, despite oui ob- 
vious mastery of these concepts for practical purposes. The practical 
importance of intelligence is so great that, in the fields of measure- 
ment and of educational policy based on pupil intelligence, practice 
has left theory far behind. Psychologists and teachers have been 
measuring intelligence for practical purposes for more than a genera- 
tion despite the controversy which still rages around the question of 
the nature and the determiners of mental ability. 

An acquaintance with the terms in which discussion concerning 
intelligence has been carried on will be helpful in understanding 
what is meant by the mental abilities of pupils. Let us therefore 
examine the concepts of intelligence under the following interrelated 
headings: 

1. General definitions of intelligence 

2. The organization of intelligence 

3. The relative influence of “environment” and “heredity” on 
intelligence 

4. Educational-vocational differences in mental ability 

5. Special mental abilities of pupils 


> GENERAL DEFINITIONS 


Definitions of intelligence have been offered throughout the his- 
tory of philosophy, but it is only with the meaning of the term in 
; 54 


d Mental Abilities 55 


z field of psychology and education during the past half century 
we are here concerned, During the period from 1880 to around 
po the definition of intelligence was reflected in the attempts to 
it through human sensori-motor processes. It was assumed 
sensory discrimination provided the key to the determination 
telligence. The attempts proceeded on the following assump- 
as stated by Francis Galton (6:27): “The only information 
ches us concerning outward events appears to pass through 
ues of our senses; and the more perceptive the senses are 
differences, the larger is the field upon which our judgment 
intelligence can act.” However, attempts to measure intelligence 
ed on this assumption proved unsuccessful. 
oward the end of this period another point of view, of which 
Binet was the leader, focused attention on a concept of in- 
gence which emphasized the more complex or highly organized 
Activities, such as memory, association, judgment, and at- 
tion. In 1905 Binet wrote: “To judge well, to comprehend well, 
ason well, these are the essential activities of intelligence” 
1191-244). Binet's approach has sometimes been referred to as 
I” approach. The ultimate purpose of intelligence accord- 
‘to Binet was the continuous adjustment of the individual to his 
vironment which is accomplished as the result of an organization 
Which several mental functions (comprehension, invention, direc- 
0, and criticism) are involved. 
les Spearman (9) has put forth his definition in terms of a 
Statement of a series of “qualitative” and “quantitative” principles 
erred to by him as the noegenetic laws). The expression of 
igence according to him consists of the following three “laws” 
ll their possible combinations: (1) The apprehension of one's 
/ experience, or the degree of a person's possession of the power 
€ what goes on in his own mind. A person cannot only 
he must also know what he feels; he cannot only strive, he 
t also know that he strives; he cannot only know, he must also 
à that he knows. This is illustrated when a person says, “I like 
Mus’ or “I am thinking of the past.” 
, Q) The eduction of relations, or the degree of a person's power 
] bring to mind any relations that essentially hold between two or 
‘ideas in his mind. This law is illustrated whenever a person 


P 


56 What Should Be Evaluated? 


makes such statements as "Seven is more than five" or "Sugar is 
sweeter than salt." 

(3) The eduction of correlates, or the degree of a person's power 
to bring into mind the correlative idea (when he has in mind any 
idea together with a relation). For example, when an individual has 
in mind the number five and the relationship "two times as great 
as," he may be able to educe the correlative idea of the number ten. 

In 1921 the editors of the Journal of Educational Psychology pub- 
lished a symposium (xr) on intelligence and its measurement, which 
produced the following definitions: 

Thorndike, “We may define intellect, in general, as the power of 
good responses from the point of view of truth or fact." 

Terman, "Intelligence is the ability to think in terms of abstract 
ideas.” 

Pintner, “Intelligence [is] the ability of the individual to adapt 
himself adequately to relatively new relations in life” implying “ease 
in breaking old habits and in forming new ones.” 

Henmon, “Intelligence is capacity for knowledge and knowledge 
possessed.” 

Thurstone, “(a) inhibited capacity; (b) analytical capacity; and 
(c) perseverance.” 

Woodrow, “Intelligence is the capacity to acquire capacity.” 

Dearborn, “Intelligence is the capacity to learn or profit by ex- 
perience.” 

Haggerty, “Intelligence is a practical concept . . . connoting a 
group of complex mental processes traditionally defined in systematic 
psychology as sensation, perception, association, memory, imagina- 
tion, discrimination, judgment, and reasoning . . . while emotions, 
instincts, will activities, and so-called character traits are for the 
most part excluded,” 

In 1927 Thorndike conceived of intelligence as including “opera- 
tions such as we may call attention, retention, recall, recognition, 
selective and rational thinking, abstraction, generalization, organiza- 
tion, inductive and deductive reasoning, together with knowledge 
and learning in general" (14 :22). He also distinguished between 
four different aspects of intelligence which were named as follows: 
(1) altitude or level, which is apparently native and determines the 
limits of response with respect to difficulty; (2) range, the number 


3 


Mental Abilities 57 
of tasks at any given degree of difficulty that we can perform; 
| (3) area, or totality of response, contributed to by both altitude and 
range; and (4) speed of response. 
- The Gestalt school of psychology, in accordance with the point 
- of view which it has developed during the past thirty years, conceives 
of intelligence as an integrated action of the organism and gives no 
"separate treatment to intelligence in itself. An intelligent act is one 
in which the elements of a situation are united to form a significant 
configuration. Intelligence is the result of a total functioning of the 
‘organism in its environment. Thus R. H. Wheeler defines intel- 
ligence as the “perception of one detail in a situation in a more com- 
plicated relationship to the total situation, namely, the relationship 
‘of means to an end" (x7 : 128). 
| From this listing of definitions of intelligence the reader may 
‘emerge with a realization of the unsatisfactoriness of abstract defini- 
ions, It is this same feeling of the inadequacy of words to define 
anything so basic and complex as intelligence that has caused many 
“psychologists to resort to the statement that intelligence is what- 
ver the ‘intelligence tests measure, Obviously, however, this is 
merely begging the question. It is impossible to escape the necessity 
‘of formulating some concept of what the test measures, But the 
full meaning of any concept of intelligence for practical purposes 
‘emerges only when we do something about it, as when we perform 
erations to measure intelligence. As is usual in science, it is here 
| case of interaction between the conceptual end and the practical 
cans. The end of intelligence tests is formulated in terms of the 
definition of intelligence, but we gain full understanding of this 
definition only in terms of the means, or intelligence tests, them- 
ves. Consequently, the definitions of intelligence formulated by 
the foremost workers in the field of intelligence measurement can 
give only a preliminary understanding of this aspect of pupils. A 
er understanding will come when we take up the question, How 
to measure intelligence, in Chapters XIV and XV. 


Tue ORGANIZATION OF INTELLIGENCE 


— The main theories concerning the organization of intelligence may 
described in terms of degrees of specificity. Thorndike has hu- 
- Iorously designated the three major theories as the "sand" theory, 


a 


58 What Should Be Evaluated? 


the "gravel" theory, and the “cobblestone” theory. In accordance 
with this classification, Thorndike’s hypothesis concerning the or- 
ganization of intellect is a “sand” theory (14). Defining intellect as 
the ability to succeed with certain tests and measuring it by taking 
a fair sampling of the tests that require intellectual power, he notes 
the customary sharp distinction between two levels of intellect: one 
involving mere connection-forming, the association of ideas, the 
acquisition of information, and specialized habits of thinking; and 
the other, the second or higher level, which is characterized by ab- 
straction, generalization, the perception and use of relations, the 
selection and control of habits in inference or reasoning, and ability 
to manage novel or original tasks. Only surface intelligence can be 
divided into these two levels, however. The deeper nature of intellect 
requires a different formulation. Thorndike’s hypothesis holds that 
quality of intellect depends upon quantity of connections. Connec- 
tions are the physiological mechanisms whereby a nerve stimulus is 
conducted to and excites action in specific nerve cells, muscles, and 
glands. Differences in the number of connections or bonds of the 
associative type are sufficient to account for both quantitative and 
qualitative differences in degree of intellect. No special qualitative 
differences in mental organization are required. 

The “gravel” theory of intelligence posits a relatively small num- 
ber of distinct mental abilities which are thought of as being rela- 
tively independent of each other, rather than the large number of 
ungrouped connections conceived by Thorndike. These distinct 
mental abilities, few in number, were called “faculties” earlier in 
the history of psychology. As this theory has developed out of the 
statistical work of such men as T. L. Kelley and L. L. Thurstone, 
they are called primary abilities, or functional unities. Using the 
method of factorial analysis of the intercorrelations among many 
tests, Thurstone (16) has arrived at the following list of primary 
mental abilities: number facilities; word fluencies; visualizing; 
memory of words, names, and numbers; perceptual speed; induc- 
tion; and verbal reasoning. That is, Thorndike’s large number of 
mental abilities or “atoms” or connections are grouped by these 
theories into a much smaller number of primary abilities. 

The third theory of the organization of mental ability, known as 
the “cobblestone” theory, is that all abilities may be classified into 


> 


Mental Abilities 59 


one of two types. This two-factor theory, first formulated by Spear- 
man, is also based on the statistical examination of correlations be- 
tween various tests. The first factor has been called the “general” 
factor and is denoted by the letter “g.” It is so named because, al- 
though varying in amount from individual to individual, it remains 
the same for any one individual in respect of all the correlated 
abilities, The second kind of mental ability is called the “specific” 
factor and is denoted by the letter "s"; it varies not only from in- 
dividual to individual, but even for any one individual from one 
ability to another. When these specific factors overlap they are 
called group factors. But, in general, any mental task involves a 
certain amount of general ability, g, which is common to all other 
mental tasks, and a certain amount of specific ability, s, which is 
involved in no other mental tasks. 

Anastasi (1 : 304) and Thomson (13) have pointed out that Thorn- 
dike and Spearman are essentially in agreement, both holding that 
à common factor is sufficient to explain differences in levels of in- 
telligence. The difference between them is in their conception of 
the common factor. Spearman calls it “general mental energy” while 
Thorndike calls it “quantity of mental connections or bonds.” But 
to Spearman’s g as an entity such as “mental energy” there is the 
objection that it is unnecessary and meaningless, in that no physio- 
logical correlate of such an idea is available. To bonds there is no 
such objection. They already exist in the form of the physiologist's 
concept of synapses, and since they can explain the quantitative 
phenomena they may better be accepted as the explanation until 
they fail to account for some new finding. 


Tue Retarive IwrLUENCE OF “Environment” AND “Hereprry” 
ON INTELLIGENCE 


We may ask, why do people differ in intelligence? The student 
will have noticed that in this chapter we have persistently put quota- 
tion marks around the words “environment” and “heredity.” This 
Was done to indicate that these two concepts are abstractions that 
in human beings have not been found separable. They are more 
usefully thought of as representing the extremes of a continuum on 
which the factors influencing human development may be placed. 
At one extreme are such traits as eye color and height; at the other 


e 


60 ( What Should Be Evaluated? 


such characteristics as the language one speaks and one's taste in 
clothes. The fact of the matter is that to ask the question, which 
is more important, environment or heredity, in one sense poses a 
question impossible to answer. The two always interact to produce 
all the characteristics of a given individual. They can in reality be 
separated no more than a river that has been formed by the junction 
of two tributaries can be separated into the two original streams. 
They are abstractions of things which in biological and social reality 
have merged into one. 

The first successful intelligence tester, Alfred Binet, believed 
that intelligence was something which could be substantially raised 
or lowered by the influences of training, as the following quotation 
shows (2:118): “With practice, ambition, and especially method 
one can succeed in increasing his attention, memory, and judgment, 
and in literally becoming more intelligent than before.” 

On the other hand Terman, the foremost exponent of the Binet 
method in the English language, placed the emphasis on. heredity 
as the determiner of intelligence; the following quotation indicates 
his point of view: “It would, of course, be going too far to deny 
all possibility of environmental conditions affecting the result of an 
intelligence test. Certainly no one would expect that a child reared 
in a cage and denied all intercourse with other human beings could 
by any system of mental measurement test up to the level of normal 
children. There is, however, no reason to believe that ordinary dif- 
ferences in social environment (apart from heredity), differences 
such as those obtaining among unselected children attending ap- 
proximately the same general type of school in a civilized commu- 
nity, upset to any great extent the validity of the scale" (12 : 116). 

This disagreement among psychologists concerning the relative 
influence of nature and nurture on intelligence has led to many ex- 
periments designed to give a more definite answer to the question. 
One group of such experiments has shown that very low intelligence, 
or feeble-mindedness, tends to run in families, 

Another group of studies has examined the development of in- 
telligence in very untoward circumstances, and intelligence thus de- 
termined has been compared and contrasted with intelligence under 
more favorable or more normal environmental conditions; studies 
of children growing up in isolated mountain communities, of canal 


Mental Abilities ‘61 


boat children in Europe, and of the effect of rural isolation all come 
under this grouping. 

A third group of studies has examined intelligence in presumably 
superior environmental circumstances and compared and contrasted 
it with the relatively less favorable or with the normal situation; 
studies of foster children of known parentage and intelligence come 
under this heading. 

A fourth group uses favorable environmental conditions, such as 
carefully planned pre-school laboratories and nursery schools, with 
measurement of intelligence before, during, and after the operation 
of the favorable educational influences in these situations. 

A fifth group begins with the “same” environment and later 
differentiates the environment; studies of identical twins separated 
in early childhood and brought up in different environments are in- 
cluded here. 

Many earlier correlational studies showed that degree of intel- 
ligence tends to be the same within families. Brothers and sisters, 
parents and children were all as much alike in intelligence as they 
were in physical characteristics such as height or weight. Identical 
twins, who are as closely related genetically as it is possible for two 
human beings to be, were most similar of all in intelligence. 

Despite many elaborate experiments and much heated controversy 
on these questions, no great lessening of the differences in emphasis 
between the “environmentalists” and the “hereditarians” has resulted. 
Instead it is coming to be increasingly recognized that the question 
of the proportional effect of heredity and environment in deter- 
mining the course of mental growth is unanswerable because it is 
in fact impossible to separate environment and heredity. That is, 
Conclusions are no longer being stated in such terms as were used 
in the following quotation from the Twenty-Seventh Yearbook of 
the National Society for the Study of Education (5 : 309): 

t "The maximum contribution of the best home environment to 
intelligence is apparently about 20 I.Q. points, or less, and almost 
surely lies between 10 and 30 points. Conversely, the least cultured, 
the least stimulating kind of American home environment may de- 
Crease the LQ. as much as 20 LQ. points. But situations as extreme 
as either of these probably occur only once or twice in a thousand 
mes in American communities.” f 


Š 


62 What Should Be Evaluated? 


In 1940 the National Society for the Study of Education published 
another yearbook (8) in two volumes on the subject of the influence 
of nature and nurture on intelligence. The interested reader is re- 
ferred to these volumes for the most comprehensive treatment of the 
subject available. The issue is attacked from all points of view by the 
leading proponents of “hereditarianism” and “environmentalism” 
but without any appreciable reduction of the differences between 
them. 

Instead of statements concerning proportionate influences, the 
issue is now being phrased as follows: What methods, if any, are 
known at present by which the course of mental growth can be 
modified? That is, can modern nursery schools or superior foster 
homes improve the mental abilities of children? Or will the mental 
ability of children of good ancestries but reared in a very inferior 

_ environment be adversely affected? At the State University of Iowa, 
the conclusions from studies designed to answer these questions 
have been decidedly in the affirmative. These conclusions have been 
seriously challenged, however, on statistical, logical, and administra- 
tive grounds. We cannot here be concerned with the details of this 
argument. Let us rather turn to a discussion of the outcomes of these 
controversies which are of practical yalue to the classroom teacher. 

1. The concept of intelligence or mental ability is culturally de- 
termined. That which is regarded as intelligence varies with time 
and place. In our culture, spoken and written vocabularies are im- 
portant. In other cultures, motor ability and skill might perhaps be 
more useful in distinguishing the able and the less able. Similarly, 
as it is now measured, intelligence has come to mean, in large part, 
aptitude for success in school, 

2. Neither extreme “environmentalism” nor extreme “hereditari- 
anism” is as correct as the middle point of view. Intelligence tests 
are neither instruments for measuring pure “native capacity” nor 
merely improved school achievement tests; whatever they measure is 
influenced to a marked degree by training and this influence should 
not be ignored in interpreting intelligence test scores, 

3. Concerning the organization of intelligence, there seems to be 
a valid'basis for believing that: (a) General intellectual ability prob- 
ably is a reality (in the sense of Spearman's g), embracing many 


Mental. Abilities 63 


ways of thinking but with emphasis on the more general, abstract, 


and complex ones. (b) Group factors or primary abilities may also 
be distinguished; rather than being fixed, however, they are prob- 
ably forms of behavior acquired in response to organized sets of 
environmental social stimuli, 


EpucaTIONAL-VOCATIONAL DIFFERENCES IN MENTAL ABILITY 


Although the proper use of evaluations of pupils’ mental ability 
for educational and vocational guidance belongs within the scope of 


“another book on guidance, it will not be inappropriate here to pre- 


sent findings which are of importance to the interpretation of data 
‘for such purposes. In the first place, there are significant differences 
in the intelligence of students in different high school curricula. 
The National Survey of Secondary Education found that in general 
‘the highest intelligence is found in the academic curriculum; the 
next highest, in the commercial courses; and the lowest average 
intelligence in the industrial arts or vocational curricula (7). 
Whether or not these differences are due to an unfair loading of 
‘the tests with such tasks as favor the abilities and interests of a 
Particular type of student is a matter of indifference, since these 
‘significant differences are meaningful for guidance purposes regard- 
less of value judgments concerning them. Furthermore, there are 
"significant differences in the intelligence of students in different 
"Subjects. In foreign languages, high school and junior college stu- 
dents had average 1.Q.’s of 105 in Spanish, 108 in Latin, 11x in 
French, and 112 in German. Similarly, the LQ. is of value in esti- 


mating a high school student's probable success in English, mathe- 


matics, science, and average academic achievements. It has least 
Value in estimating success in art, music, and home economics. In 
the third place, there seem to be definite intelligence requirements 
for the practice of certain professions in so far as very few persons 
with intelligence test scores below a certain minimum have been 
found who have achieved success in these occupations. Tables show- 
ing this relationship between intelligence and a great number of 
Occupations are presented in Bingham's Aptitudes and Aptitude 


t Testing (4 :44-59) and Strang's Behavior and Background ‘of Stu- 


in Colleges and Secondary Schools (xo : 122-128). 


c 


64 What Should Be Evaluated? 


SpeciaL Menta AniLrTIES OF PuriLs 


Our discussion of theories of mental organization mentioned those 
theories which group intelligence according to a relatively small 
number of primary abilities or faculties on the basis of factorial 
analyses of performances on mental tests. Thurstone, for example, 
now believes that it is possible to describe each individual in terms 
of at least seven indices which should replace the intelligence quo- 
tient, mental age, and other gross scores of general intelligence. 
That is, each individual should be described in terms of a profile 
of mental ability instead of by a single index of intelligence. Such 
a mental profile, composed of the scores in the primary mental 
abilities, is considered by Thurstone to be more helpful in educa- 
tional and vocational counseling than the single composite intel- 
ligence rating that has been in common use for many years (15). 

Whether the breakdown of general mental ability is in such 
terms as Thurstone proposes or in terms of the more traditional 
"special aptitudes" is a question of practical convenience. In using 
"Thurstone's approach it would be necessary to determine the degree 
to which various occupations require each of the primary abilities, 
then to determine the degree to which an individual possesses each 
of the primary abilities, and finally to determine the individual's 
fitness for various occupations in terms of the agreement between 
his profile and the occupational profiles. 

The more traditional approach makes the breakdown first in 
terms of fields of endeavor such as music, art, engineering, and then 
constructs measuring instruments specifically for the prediction of 
success in each of these fields. That is, the construction of most tests 
designed to help in estimating the probabilities that a person will 
be able to follow successfully an occupation he is considering has 
attempted to break down mental ability in terms of culturally de- 


termined fields of endeavor, e.g., musical aptitudes, art aptitudes; 


verbal aptitudes, mathematical aptitudes, legal aptitudes, nursing 
aptitudes, engineering aptitudes, teaching aptitudes, clerical apti- 
tudes, mechanical aptitudes, and the like. 

Let us now examine what is meant by the term aptitude, or 
special mental ability. Warren's Dictionary of Psychology defines 
aptitude as a “condition or set of characteristics regarded as symp- 


Mental Abilities 65 


tomatic of an individual’s ability to acquire with training some 
(usually specified) knowledge, skill, or set of responses such as the 
ability to speak a language, to produce music, etc.” More briefly, 
aptitudes are present traits considered as predictors of future achieve- 
ment. In using the term aptitudes we do not necessarily imply 
specifically inherited aptitudes. A good achievement test in algebra 
is a predictor of future achievement in engineering (ie, measures 
one aspect of engineering aptitude) just as much as pitch discrimina- 
tion is one aspect of musical ability. Yet achievement in algebra is 
toa great extent the result of training, whereas ability to discriminate 
pitch has been found to be relatively little subject to training. 

Aptitudes may be either acquired or inherited. “Present traits” of 
the individual include not only his ability but also his interests, ad- 
justments, character, physical aspects, and indeed all his other 
aspects; but in evaluating his aptitudes we usually concentrate on 
the one, two, three, or more aspects which are most valuable for 
predicting success. 

Aptitudes or special mental abilities are similar to other aspects of 
individuals in that they are not all equally strong within the same 
person, and in that people differ from one another with respect to 
their possession of any single aptitude. Furthermore, like general 
mental ability, aptitudes may be considered to be relatively stable. 
It is often far more economical to fit an individual’s field of en- 
deavor, either educational or vocational, to his present set of ap- 
titudes than to attempt to change his aptitudes through some kind 
of training so as to enable success in some field of endeavor. 

This does not mean that in evaluating aptitudes we must.search 
for aspects of the pupil which have not been influenced by experi- 
ence and training. That is, even aspects that have been subjected 
to much training—in other words, the present achievements of 
pupils—should be evaluated in so far as these achievements may be 
considered predictive of future achievement. What this means is 
that not merely tests that are labeled “aptitude” may rightfully be 
used for the measurement of aptitudes. This consideration, although 
it is mainly relevant to our subsequent treatment of the question, 
how to evaluate special ability or aptitudes, is important here for a 
valid understanding of what is meant by these terms. 

The real distinction between achievements and aptitudes inheres 


^ 


66 What Should Be Evaluated? 


not in what is measured but rather in the purpose for which it is 
measured—whether the point of view is backward-looking or for- 
ward-looking, whether the concern is with the pupil’s past or with 
his future, at any particular moment in our thinking about him. If 
we may consider a pupil’s capacity to profit from further training 
as represented by the ratio between his present achievements and 
the amount of training he has received to date, then his aptitude or 
capacity may be evaluated whenever it is possible to specify the 
amount of training, and whenever it is possible to evaluate present 
achievement. 

The available opportunities, either in school or in the world of 
work, are a vital phase of our concern with special abilities or ap- 
titudes. An acquaintance with these various opportunities, with the 
abilities and training they demand, with the earnings and prestige 
they return, and with the changes of trend in the opportur ‘iies to 
enter these fields are all second in importance only to acquaintance 
with the pupil himself. Concerning opportunities in school, the 
sources of information are well known to most teachers. Similar in- 
formation concerning the world of work is made available by many 
books and services, of which the following are representative: 

1. Bingham, W. V., Aptitudes and Aptitude Testing, Part 2: 
“Orientation Within the World of Work.” Contains valuable dis- 
cussions of manual occupations, skilled trades, clerical occupations, 
and the professions such as engineering, law, medicine, dentistry, 
nursing, teaching, music, and art. 

2. Publications of the Occupational Information and Guidance 
Service, U. S. Office of Education, Washington, D.C. Among the 
recent publications of value are: 

a. Guidance Bibliography: Occupations. W. J. Greenleaf. Reis- 

sued 1940. A brief list of references annotated and indexed. 

b. Eighty New Books on Occupations. W. J. Greenleaf. 1940. 

€. Trends in Occupations and Vocations. W. J. Greenleaf. 1940. 

3. Dictionary of Occupational Titles, U. S. Employment Service, 
Department of Labor, Washington, D.C., 1939. (For sale by the 
Superintendent of Documents, Washington, D.C.) Part I: Defini- 
tions of Titles, 1287 pages, $2.00. Part I includes job definitions pre- 
pared by the U, S. Employment Service for use in employment 
offices and vocational services; 17,452 separate jobs are defined and 


» 


Mental Abilities 67 


12,292 alternate titles are mentioned. It incorporates a significant oc- 
cupational classification and code, and the glossary lists alphabetically 
definitions of technical terms and names of machines and equipment. 
Common commodities sold in retail and wholesale trade are listed. 
Industries and jobs in each trade are listed. 

4. Publications of the Science Research Associates, 1700 Prairie 
Avenue, Chicago, Illinois. Among the services offered for students 
are the monthly career publication, Vocational Trends; the monthly 
vocational monograph, American Job Series; the monthly reading 
index, Vocational Guide; and the monthly Reprints and Abstracts 
of important vocational articles. 

No special reason exists for our concern with occupational in- 
formation at this point except that the specific definition of special 
abilities or aptitudes must eventually refer itself to such informa- 
tion. Traditionally, and perhaps justifiably, special mental abilities 
have occupied the focus of attention in discussions of vocational 
guidance, although, of course, it is only in the light of evidence 
Concerning every aspect of the whole pupil that such guidance can 
be valid. 


SUMMARY 


The definition of intelligence is not yet completely agreed upon 
but centers around higher mental processes rather than sensory dis- 
criminations. The verbal formulations of Binet, Spearman, Thorn- 
dike, and others are mentioned, and reference is made to measuring 
Instruments as the most concrete manifestation of the definitions. 
The organization of intelligence in terms of various degrees of 
Specificity or generality is considered; the current theories reject 
thoroughgoing specificity and agree that the truth lies somewhere 
between complete generality and primary, or group, factors. The 
nature-nurture controversy is briefly examined; the practical im- 
Plication is that intelligence tests should be interpreted in the light 
of the individual’s background, environment, and training. Educa- 
Honal-vocational differences in intelligence and their implications 
for guidance are pointed out as an addition to any practical under- 
standing of the concept of intelligence. Special mental abilities are 
Considered from the standpoint of both primary factors and educa- 
tional or vocational “aptitudes” and are considered in the general 


© 


68 What Should Be Evaluated? 


sense of any aspect of pupils useful in predicting success of various 
kinds. The importance of vocational information in any realistic 
analysis of aptitudes is pointed out and sources of such information 
are mentioned. 


QUESTIONS 


Draw up an original definition of intelligence and defend it against 

possible criticism, 

2. To what extent does your own observation of yourself and others sup- 
port the group factor theory of the organization of mental ability? 
Do you know of persons with markedly greater ability in some re- 
spects than others? 

3. lf abilities do go in groups to some extent, do you think the grouping 
is according to materials (such as words, numbers, and drawings) 
or according to mental processes (such as memory, induction, and 
deduction)? 

4. Discuss pro and con the following assertion: A school teacher should 
keep abreast of the controversy concerning the relative influences of 
"heredity" and "environment" on intelligence. 

5. How do educational and vocational differences in intelligence test 
scores help to define what is meant by intelligence? 

6. In what sense may all the aspects of pupils considered in this book be 

considered aptitudes? How is the term "aptitudes" usually restricted 

when used in naming tests? 


I 


REFERENCES 


"t. Anastasi, Anne, Differential Psychology, New York: The Mac 
millan Company, 1937. 

2. Binet, Alfred, Les Idées modernes sur les enfants, 1909, p. 118, a8 
quoted in Peterson, J., Early Conceptions and Tests of Intelligence, 
Yonkers: World Book Company, 1935. 

3. Binet, Alfred, and Simon, T., “Methodes nouvelles pour le diagnostic 
du niveau intellectual des anormaux,” L'Année psychologique, 
II : 191-244. (1905). 

4. Bingham, W. V., Aptitudes and Aptitude Testing, New York: 
Harper & Brothers, 1937. 

7/5" Burks, B. S., in “Nature and Nurture, Their Influence upon In- 
telligence” (Part I), Twenty-seventh Yearbook, National Society for 


E 


Mental Abilities 69 


the Study of Education, Bloomington, Illinois: Public School Publish- 
Company, 1928. ' 
ton, Francis, Inquiries into Human Faculty and Its Develop- 
t, London: Macmillan & Co., Ltd., 1883. 

efauver, G. N., and others, The Horizontal Organization of 
condary Education, National Survey of Secondary Education, 
lonograph No. 2, 1932. 

onal Society for the Study of Education, “Intelligence: Its 
re and Nurture; Part I, Comparative and Critical Exposition; 
II, Original Studies and Experiments,” Thirty-ninth Year- 
k, National Society for the Study of Education, Bloomington, 
inois: Public School Publishing Company, 1940. 

carman, Charles, The Abilities of Man, New York: The Mac- 
lan Company, 1927. 

ig, Ruth, Behavior and Background of Students in Colleges 
d Secondary Schools, New York: Harper & Brothers, 1937. 
iposium, "Intelligence and Its Measurement,” Journal of Edu- 
al Psychology, 12 : 123-147, 195-216, 211-275 (1921). 

an, L. M., The Measurement of Intelligence, Boston: Hough- 
| Mifflin Company, 1916. 

nson, G. H., "The nature and measurement of the intellect," 
achers College Record, 41 : 726-750, 1940. 

orndike, E. L., and others, The Measurement of Intelligence, 
W York: Teachers College, Columbia University, 1927. 

lurstone, L. L., “A new concept of intelligence and a new method 
measuring primary abilities," Educational Record, Supplement 
* 10, pp. 133-134, 1936. 

tone, L. L., Primary Mental Abilities, Chicago: University of 
ago Press, 1938. 

eeler, R, H., The Science of Psychology, New York: The Thomas 
well Company, 1929. 


CHAPTER V 


Emotional and Social Adjustment 


THE TWO MAJOR IMPRESSIONS USUALLY OBTAINED BY TEACHERS OR 
psychologists when they come to consider the field of emotional and 
social adjustment are that the field is both immensely important and 
immensely confused. The importance of feeling well, of not being 
continually stirred up, and of getting along well with people is 
second only to the importance of living itself. And the number of 
possible points of view, of ways in which the field of emotional and 
social adjustment may be analyzed, seems to have been limited only 
by the number of individuals who have attempted such analyses. 
But our consideration of these aspects of pupils will be guided and, 
we hope, given significance by our purpose to meet the teacher's 
practical needs for evaluation data for guidance purposes. Controlled 
by this purpose, let us consider the nature of adjustment and malad- 
justment. 


Tue NATURE or ÁDJUSTMENT 


Adjustment may be defined very generally as the process whereby 
a living organism varies its activities in response to changed condi- 
tions in its environment. An organism’s needs can be fulfilled only 
by behavior that is effectively adapted to its opportunities. When 
external circumstances change, the organism must modify its be- 
havior and discover new ways of satisfying its wants. The three 
fundamental ways in which the adjustment process can take place 
are new forms of response, changing the environment (this form 
of adjustment has been called mastery), and the modification of 
the organic needs themselves. But conceived thus generally, ad- 
justment may be identified with living itself. Thus, according to 

> 70 


Emotional and Social Adjustment 71 


Shaffer, “As long as an animal continues to adjust and to modify 
its responses it continues to live. If it fails to adjust in some degree, 
its existence is imperiled. When an animal ceases entirely to adjust, 
itis dead” (7 :3). 

The common understanding of the term “adjustment,” however, 
includes only the more fundamental, continuous, and pervasive 
levels of activity and not such narrow functions as the eye's reaction 
to light, the finger’s reaction to pain, or the “mind’s” reaction before 
such a problem as 2 4- 2 — ?. That is, adjustment as it is generally 
understood means the satisfaction of certain drives, needs, basic 
motives, urges, desires, or tendencies, which involve the whole or- 
ganism. 

Many lists of such “urges” have been made. Among these have 
been self-preservation, race preservation, inquisitiveness, combative- 
ness, fear, gregariousness, sociability, maternal love, sex, construc- 
tiveness, sympathy, rivalry, secretiveness, feeding, curiosity, self-as- 
Sertion, questioning, imitation, jealousy, repulsion, submissiveness, 
shyness, modesty, playing, walking, friendliness, cooperation, and so 
on. It is impossible to choose the essential few from among these 
Various lists, on the basis of scientific validity. The inadequacy of 
Words to symbolize dispositions and personality constellations, or 
determining tendencies, is well indicated by the abundance. of such 
- words. Allport and Odbert (2) have presented a list of 17,953 terms, 
each of which specifies in some way a form of human behavior. 

However, no matter how yague and inadequate the words or 
symbols may be, it is possible to categorize the fundamental needs 
of human beings into a brief but meaningful list. For our present 
Purpose, which is to provide a basis for the understanding of the 
emotional and social adjustment of pupils, we shall not be concerned 
with the so-called “unlearned” organic drives or goals such as air- 
getting, temperature regulation, hunger, thirst, rest and sleep, 

imination, and so on. Rather, we are here concerned with the 
forms which these drives take when they have been molded by 
interaction with the social-cultural milieu in which they must be 
Satisfied. At this level of discussion, it should be clear that we are 
Boing beyond the bounds of experimental information, and can be 
guided only by the requirements of clarity, simplicity, and meaning- 

ess, Therefore, let us now proceed to listing the desires, motives, 


72 Whar Should Be Evaluated? 


urges, intention, purposes (or whatever else is the most meaningful 

name for these aspects of pupils) in such terms as will be most useful 

to classroom teachers. 

The more important motives of pupils at all levels of education, 
as these motives are related to an understanding of the emotional 
and social adjustment of pupils in our culture, may be formulated 
as follows: 

1. The desire for social approval: Favorable attention, sympathy, 
companionship, conformity to the mores, customs, and fashions . 
of one social group are all basic needs of pupils. Social approval 
is one of the most powerful forces by which personality and be- 
havior are determined. 

2. The desire for mastery: The urges to excel, to succeed, to over- 
come obstructions, to defeat a rival, to achieve a goal, to solve 
a problem, to dominate a situation are all manifestations of this 
type of motive. Success and mastery along some line of endeavor 
are essential to the emotional well-being of everyone. 

3. The desire for new experience: Exploratory patterns, curiosity, 
inventiveness, concern with the fresh, the strange, and the un- 
familiar, all seem to be a basic need of human beings. A fixed 
routine in time or space can be followed for only a relatively 
small segment of one’s lifetime before this urge toward novelty 
becomes irrepressible. 

4. The desire for security: The feeling of being wanted, of being 
assured that one’s presence and contribution are welcome, the 
need for stable affection from family and personal relationships, 
all constitute another important category of human. motivation. 
The origin of this desire, in the physiological needs and the love 
responses of the infant, is strongly related to but not identical 
with the derivation of the need for social approval. 

5. The desire for individuality: The need to assume adult respon- 
sibility, to take up obligations and become independent of the 
family's material and emotional support, to attain adult individ- 
uality or self-integration, is a motive derived largely from the 
needs of society. The continuously recurring truth that today's 
children must soon run the world has caused this desire for 
independence and responsibility to become an integral part of 
human make-up. 


» 


Emotional and Social Adjustment 73 


Essential to the proper understanding and use of these five cate- 
gories of motivation is an appreciation of their interaction and inter- 
dependence. In all but the most elementary situations an individual's 
behavior is determined by combinations of motives; frequently all 
of the strong motives of the individual are determiners of a single 
— action. Our categories also make no claim to be definitive or com- 
plete. Like all such lists this one is merely a convenient descriptive 
device, to be interpreted neither as explanation nor as cause. 

Now, what is the relationship of this discussion of motivation to 
the classroom teacher's understanding of what aspects of emotional 
- and social adjustment of pupils should be evaluated? The answer to 
this question may be given in terms of an analysis of the process of 
adjustment. Such an analysis should show the relationship between 
Motivation and adjustment so that the nature of emotional and 
‘social maladjustment may be better understood. 

- The adjustment process may be analyzed into four principal steps: 
The operation of a motive. 
The presence of some obstacle to the immediate satisfaction of 
the motive. Such obstacles or thwarting factors may be divided - 
into three general classes: environmental obstacles, such as walls 
around a prison, or an oversolicitous mother, or the mores and 
— customs of society, or the activities of other persons; personal 
defects; such as lameness, ugliness, mental defects, social defects 
such as lack of position or education, emotional instability; con- 
flict between antagonistic motives, which, unlike physical forces, 
do not cancel each other but rather result in increased tension, 
vacillation, and non-specific activity. 
Varied responses, or trial and error reactions, using either motor 
of symbolic processes, and guided in varying degree by past 
experience, It is at this stage that emotional and social maladjust- 
ments usually arise. If the varied responses are not immediately 
Successful in bringing about a satisfaction of the desire, an 
emotional response may arise leading to a persistent non-adjustive 
reaction, that is, an excessive persistence in an unadaptive mode 


À Environmental and personal obstacles interact with one another. Thus the size 
of an environmental obstacle which will successfully thwart a motive is inversely 
oportional to the degree of personal defect of an individual in the area in which 
it obstacle operates, 


< 


What Should Be Evaluated? 


of activity. Failure to adjust continues with thwarting and the 
lack of satisfaction of the desire which in turn produces another 
attempt to overcome the obstacles, another thwarting, and an- 
other heightening of the emotional responsc—a vicious circle, 
Here we have the key to the understanding of emotional and 
social adjustment or maladjustment. 

4 Solution, satisfaction, adjustment of the problem, desire, or mo- 
tive. This is the goal of the adjustment process, the point at which 


x 


opens the way to objective evaluation and guidance, should be en- 
couraged if the above discussion is to have the desired result. The 
adjustment or maladjustment which results from the satisfaction or 
thwarting, respectively, of the basic needs of pupils may now be 
seen in its proper relationship to those needs, 


‘Tue Maxıresrations op MALADJUSTMENT 


We are now ready to examine in some detail the manifestations 
of emotional adjustment and social adjustment. In what ways do 


| 
| 
| 
! 
| 


E 


Emotional and Social Adjustment 5 


ignore or admire (and therefore encourage) some of those forms 
of behavior which reflect poor mental health in their pupils. This 
has been shown by several studies on teachers’ attitudes toward pupil 
behavior (1, 9, 10). 

‘Teachers regard as most serious the behavior problems of children 
which attack the teacher's moral sensitivities, personal integrity, 
authority, or immediate teaching purposes. In contrast, the quict, 
compliant, submissive, obedient child whose behavior is agreeable 
to teachers and who respects their authority is not considered to be 
a maladjusted child. Yet it happens that the amount of trouble a 
child gives a teacher is not a valid measure of his mental health, It 
is the consensus of clinical psychologists that the withdrawing 
modes of behavior are more serious and dangerous than the ag- 
gressive types because they are more likely to escape detection, be- 
cause they are more difficult to overcome in treatment, and because 
they more frequently lead to serious mental disorder, This latter 
clause is based on evidence from a large number of careful studies 
concerning the earlier childhood characteristics of persons who later 
became mentally ill. The results obtained by Bowman (3) confirmed 
in general the findings reported by other observers (5) and justified 
the conclusion that persons who later developed functional mental 
illnesses exhibit distinctive personality traits long before there is 
any evidence of a definite mental disease. The most striking of these 


teachers, relative inflexibility in adjusting to any change in the estab- 
lithed routine, extreme sensitivity which increases with age. Per- 
haps most striking, these children were not “behavior problems” in 
the usual sense of that term but rather were usually “model” 


Apart from this general statement that the more serious social and 
‘motional adjustment problems are usually to be found among the 
quiet, well-behaved "model" children rather than among the noisy, 
troublesome “problem” children, we have not yet come to grips 


76 What Should Be Evaluated? 


with the problem of what the teacher should evaluate in this area, 
Let us now, therefore, discuss more specifically the symptoms with 
which a teacher should be concerned in evaluating a pupil's emo- 
tional and social adjustment. Myers (6) has provided a useful clas- 
sification of the specific types of mentally unhealthy pupils most 
frequently encountered in school: 

1. The "unsociable" child 

2. The “model” child 

3. The “defensive” child 

4. The "nervous" child 

5. The "emotional" child 

The “unsociable” child is one who is always wandering off by 
himself, prefers to play alone, shows lack of interest in joining 
other children in their activities, is bookish, and usually likes noth- 
ing better than to stay in and do little chores for the teacher. Such a 
child is developing a habit of social withdrawal which does not 
provide a good foundation for later mental health. He may come to 
feel that he can have more fun when he is off by himself than 
when there are others around to interfere with his wishes, when 
he is keeping to himself, close-mouthed, uncommunicative, and 
secretive. Inattention or, more accurately, attending to Ais own 
ideas, hopes, fantasies or daydreams, rather than to what the teacher 
holds before him, is a simple way of accomplishing this end. Such 
daydreaming may be of the free type or it may be systematic, per- 
sistently following a central theme. Similarly, shyness or self-con- 
sciousness indicates not modesty but rather an undue preoccupation 
with oneself and a potentially dangerous lack of self-confidence. 
The “model” child displays characteristics which are commonly 

regarded as virtues but he carries them to such an extreme’ that 
they become symptomatic of poor mental health. Such virtues as 
neatness, conscientiousness, courtesy, honesty, ambition, caution, of 
thrift become excessive and are undesirable when they constitute à 
means of evading difficulties. The child who carries any of them to 
excess is often one who, having inadequacies in more important 
things, such as a real lack of any other of these virtues, finds that 
he can obtain the approval he craves by simply being a paragon in 
some other respect. 


Emotional and Social Adjustment 77 


The “defensive” child is one who rationalizes, that is, gives 
borate and logical-sounding reasons to explain an act which was 
illy performed for purely emotional reasons. This useful method 
self-justification and fact-dodging, and its allied mechanisms, 
ibis and bragging, become harmful and symptomatic of mental ill 
th when they are carried to excess. 

"nervous" child is timid, fearful, anxious, shy, awkward, 
illy ill at ease, according to one understanding of this vague 
1 misused term. Or this term may mean irritability, tenseness, 
d overactivity. It is not an explanation and does not refer to 
ig wrong with the nervous system. The "nervous" child does 
cure and uncertain, and does believe he is different. Tics, 
grimacing, twitching, jerking, biting the nails, pulling the 
shrugging, nodding, blinking, twisting, picking, scratching, 
$0 on, are manifestations of such a condition. These mannerisms 
inctly of less importance. More important is the neurotic 
of exaggerating minor pains and illnesses in order to secure 
athy and attention. 

“emotional” child is one who has acquired (our system of 
al responses, as we have it in childhood and adulthood, 
nost wholly acquired) unstable, uncontrolled habits of emo- 
expression. Inability or unwillingness to repress emotion is 
tible with group living. But mere repression is not enough, 
otion, whether repressed or not, arouses widespread organic 
ürbances of the vegetative functions, the digestive activities, 
secretions, circulation, blood chemistry, and so on, with im- 
effects on physical and mental health. Not merely emotional 
but less occurrence of emotion itself is desirable, This 
5 that teachers should be extremely sparing in their use of 
an incentive. Less emotionalizing will help lessen the fear 
hild who has developed so many fears that he is almost al- 
ys afraid, has no confidence in himself or others, and shows great 
about trying anything new or strange. He has disturbances 
ion, night-terrors, inability to go to sleep, restlessness, lack 
te, food fussiness, and a general appearance of malnutrition. 
aspects of the emotional and social adjustment with which 
s should be concerned may be given more specificity than 


78 What Should Be Evaluated? 


is contained in the above discussion by the list of behavior prob- 
lems in children presented in Table 3 from Wickman’s study. This 
list not only presents the fifty behavior problems most frequently 
reported by teachers, but ranks them according to the seriousness 
attached to them by consensus of thirty clinical psychologists. The 
reader should note the correspondence between this rank order* 
and the earlier discussion in this chapter. 


FURTHER Aspects or PERSONALITY 


In addition to the descriptions of the various forms of personality 
problems among pupils and the ranked list of behavior problems 
frequently encountered by teachers, our present answer to the ques- 
tion, what do we mean by the emotional and social adjustment of 
pupils? should be oriented to the formulation of the field reflected 
in existing tests and rating devices. Such a formulation may be ob- 
tained from the chart prepared by Traxler (8 : 59-62) which lists the 
traits for which three or more personality tests or rating devices have 
been constructed, The traits or aspects of personality for which Trax- 
ler lists the most tests or rating devices are those subsumed under 
“emotional adjustment” and “social adjustment,” which indicates 
that these are the most pervasive aspects of personality. However, 
certain other aspects of personality are mentioned which are in- 
cluded neither in other chapters of this book (as are social attitudes, 
vocational interests, and health and physical adjustment) nor spe 
cifically within the terms “emotional and social adjustment.” These 
aspects should be familiar to teachers because they provide additional 
points of view from which to evaluate their pupils. Following isa 
list of them: ascendance-submission, introversion-extroversion, neu- 
rotic tendency, self-control; self-reliance, assurance, confidence; home 
adjustment, school adjustment, honesty, responsibility-dependability, 
initiative, cooperativeness, and masculinity-femininity. : 

? We have computed a rank order correlation coefficient of .85 between the rank- 
ing in Wickman's study and that given by forty-two other child psychologists 0n 
twenty-three of these behavior problems. (These latter rankings arc reported by 
C. Thompson, “The attitudes of various groups toward behavior problems of 
children,” Journal of Abnormal and Social Psychology, 35 : 120-125, 1940.) This 
correlation coefficient shows that there has been little change during a twelve-yeat 
period in the attitudes of clinical psychologists toward these behaviors as symptoms 
of personality development. 


Taste 3.—Ratings by Mental Hygienists on the Relative Seriousness of 
Behavior Problems in Children (Ratings of 3o Clinicians) (9 : 127) 


Rated Seriousness of Problem 
Rating Scale 
Of Only Of Considerable Of Extremely 
‘Type of Problem Average Slight Importance Importance Great Importance 
Score 45 125 20.5 
UCM EA emu s UTE 


17.3 GCZTECSEGTU LL TEESCTI PL ENENET 
jo a An 
(SE 
| SERRE es Sa ar 
iO Se aS Sa Or 

1) Ei a 

l1; =e T EE 

113 EE Re PE 

10: Sa 

13: i NU CNS 

150 CMTIUEIRGUG GIN T amp 

O ECTITUHTEUEUILNEPIUS 

1.2 SS T 
Ur Se 

O O ST eS ES 
a GENTE 

DD Ss 

111 Cn 

1l; Sa 

loo aD 

lo4 LS eS 


Unsociainess... 
Suspiciousness. 
Unhappy, depresse: 

- Resentfulness. 
. Fearfulness. 
Cruelty, bullying. 
Easily discourage 


) == 


Qo € o Pr wana OL 


y 
© 


80 What Should Be Evaluated? 


Ascendance-submission refers to a person’s tendency to dominate 
or be dominated by his fellows in various face-to-face relationships of 
everyday life. This aspect of personality reflects most directly the 
“desire for mastery” of our list of fundamental motives. Like other 
traits with twofold names, ascendance-submission is continuously 
distributed from extreme ascendance to extreme submissiveness, most 
people coming between the two extremes. Furthermore, the question 
whether it is possible to generalize about individuals with respect to 
their ascendance-submissiveness may be answered affirmatively in 
the light of the self-consistency of tests designed to measure this trait; 
that is, a tendency to dominate or submit is not specific for each of 
the myriad social situations in which a person finds himself, but 
rather pervades a person’s responses to many situations. Thus the 
individual who is ascendant at a party will be ascendant in dealing 
with clerks, in meeting strangers, in assuming leadership, and so 
forth. 

Introversion-extroversion refers to whether an individual turns in- 
ward or outward in his attempts at adjustment. This is one of the 
earliest and most popularly known classifications of personality. The 
extrovert is the person who is oriented to objective facts habitually; 
his entire consciousness looks out upon the world “because the im- 
portant and decisive determination always comes to him from with- 
out" (4). The extrovert's interests and attention follow objective 
habits, especially those of the immediate environment. On the other 
hand, "introverted consciousness . . . selects the subjective determi- 
nant as the decisive one." Jung divided each of his two major types, 
extroverts and introverts, among four additional “function-types” 
based on his analysis of the chief varieties of human expression: 
These are stated as thinking, feeling, sensation, and intuition; when 
these are combined with the two major types we have eight subtypes. 
Here again we may doubt whether people can be divided into two 
distinctive classes or whether they mostly lie between these two ex- 
tremes, as ambiverts. Introversion, furthermore, should not be con- 
fused with seclusiveness; that is, we should not lump together a 
liking for thought and absorption and a like or dislike of people, 
since the two are not closely related. Since the other aspects of 
Traxler’s list are expressed in ordinary, readily understood terminol- 
ogy, they need not be defined here. 

. 


Emotional and Social Adjustment 81 


As was stated before, these personality traits, conceived under the 
heading of what aspects of pupils should be evaluated, are useful 
mainly as points of view, or springboards, for the teacher in consider- 
ing pupils. While it is easy to define what is desirable in the field of 
behavior problems and general adjustment, emotionally and socially, 
it is more difficult to say what points along the continua of ascend- 
ance-submission and extroversion-introversion are most desirable. 
These concepts are not yet related to specific kinds of educational 
and vocational fitness, as are certain types of mental abilities. But in 
our culture the notion has arisen that somehow introversion—on the 
whole—is less normal than extroversion, less effective, and less de- 
sirable, 

It must be realized, of course, that the value of any point of view 
concerning personality depends largely upon the availability of meth- 
ods for making evaluations from that point of view. For example, 
such concepts as extrovert and introvert are of less value until we 
have tests, rating scales, or other techniques that enable us to place 
an individual with some accuracy at a point along the continuum 
of extroversion-introversion. This question will be taken up in 
Chapter XVI. Furthermore, the discussion of aspects of personality 
and adjustment in this chapter should be supplemented by additional 
readings in the field of personality, child psychology, and mental 
hygiene. 


SUMMARY 


The nature of adjustment is considered in terms of a list of funda- 
mental motives and of the general process whereby individuals 
achieve the satisfaction of these motives. The manifestations of 
maladjustment frequently not recognized as serious by teachers are 
described in general, in the form of descriptions of types of mentally 
unhealthy pupils, and in terms of Wickman’s ranked list of be- 

avior problems. Such further aspects of personality as ascendance- 
submission, introversion-extroversion, and others are briefly discussed 
as additional points of view for the consideration of pupil adjustment. 

he importance of further study of mental hygiene by teachers and 
of the availability of devices for evaluating aspects of personality is 
smphasized, 


82 What Should Be Evaluated? 


QUESTIONS 


. From your experience cite one or more examples of emotional or 
social maladjustment. Analyze the case in terms of the motive un- 
satisfied, the nature of the thwarting obstacle, and the reasons for 
the inadequacy of the attempts to overcome the obstacle. If possible, 
suggest a desirable solution or treatment. 

. Note the types of behavior problems at the top and bottom of 
Wickman’s ranked list. Why did the clinical psychologists assign 
these ranks to these problems? 

3. What differences in point of view lead to disagreement between the 
importance assigned to various behavior problems by teachers and by 
clinical psychologists? 

. Contrast the "mental hygiene" point of view with the moralistic 
approach to behavior problems. 

- To what extent is it a teacher's responsibility to be certain that every 

pupil experiences success and mastery in at least one line of endeavor? 

How may the desire for new experience lead on the one hand to anti- 

social behavior and to highly socially approved behavior on the 

other? 


REFERENCES 


1. Ackerson, Luton, “On evaluating the relative importance or serious 
ness of various behavior problems in children,” Journal of Juvenile 
Research, 20 : 114-123 (1936). 

2. Allport, G. W., and Odbert, H. S., Trait-Names; A Psycho-lexical 
Study; a Study from the Harvard Psychological Laboratory, Prince- 
ton: Psychological Review Company, 1936. 

3. Bowman, K. S, “Study of the pre-psychotic personality in certain 
psychoses,” American Journal of Orthopsychiatry, 4 : 473-498 (1934): 

4. Jung, C. G., Psychological Types, New York: Harcourt, Brace & 
Company, Inc., 1923. 

5. Kasanin, J. N., and Veo, L., *A study of the school adjustment of 
children who later in life became psychotics,” American Journal of 
Orthopsychiatry, 2 : 406-409 (1931). 

6. Myers, C. R., Toward Mental Health in School, Toronto: Uni- 

versity of Toronto Press, 1939. 

Shaffer, L. F., The Psychology of Adjustment, Boston: Houghton 

Mifflin Company, 1936. 


= 


Emotional and Social Adjustment 83 


er, A. E., The Use of Tests and Rating Devices in the Appraisal 
of Personality, New York: Educational Records Bureau, Educational 
Records Bulletin No. 23, 1938. 

ickman, E. K., Children’s Behavior and Teacher's Attitude, New © 
fork: Commonwealth Fund, 1928. 
'ourman, J., "Children identified by their teachers as problems,’ 
rnal of Educational Sociology, 5 : 334-343 (1932). 


CHAPTER VI 


Attitudes 


Tue Importance OF ATTITUDES 


IT WAS NOTED IN CHAPTER II THAT ATTITUDES ARE AN IMPORTANT PART 
of many statements of instructional objectives. Educational philoso- 
phers, curriculum builders, administrators, and teachers are all be- 
coming increasingly aware of certain inevitable “non-intellectual” 
effects of the educative process (14). That is, educational procedures 
and curriculum content can and do change attitudes. And even 
where attitudes are not subject to the influence, conscious or uncon- 
scious, of the school but rather are shaped by out-of-school expeti- 
ences, the school cannot escape the need for being concerned with 
these aspects of pupils. Thus Briggs (5 : 401) considers attitudes and 
interests so important that he devotes more than one-third of his 
Secondary Education to these factors: “The emotionalized attitudes 
function constantly—for the intelligentsia in demanding and inter- 
preting knowledge, for them and for all the rest of mankind in vary- 
ing degrees leading more or less immediately to action. The very 
triumphs of civilization in extending its bounds have increased the 
inherent importance of recognizing, modifying, and directing the 
emotionalized attitudes.” 

This point of view means that education can no longer proceed as 
if human beings were intellectual machines activated by pure reason: 
Both from the standpoint of society and from the standpoint of the 
individual, the importance of attitudes needs only to be mention 
in order to be appreciated. (For the present let us define attitudes 
roughly as “feelings for or against something.”) Socially, the system 
of morals which governs any given group or society can be truly 
said to be a matrix of attitude patterns which constitutes the "fly- 


84 


1 


Attitudes 85 


of society," to use the phrase of William James. To the extent 
it these attitude patterns function in the lives of individuals in 
ety without creating undue stresses and strains, they constitute 
acteristics of a society which is stable with respect to its aims 
| purposes. Education must therefore be fundamentally concerned 
th whether it is producing types of attitude patterns that are de- 
ble as the integrating forces in society. 
Joes this imply a system of indoctrination rather than of educa- 
IP These two words are themselves charged with feeling. Edu- 
ors have upheld the slogan that children must be taught how to 
ak but not what to think; that is, children must be educated, not 
octrinated. However, this slogan contains a fundamental psycho- 
cal self-contradiction. Thinking does not occur in the abstract or 
acuum but proceeds in terms of the individual's background of 
patterns and experience. Consequently, in the process of 
g how to think we shall inevitably and necessarily also to a 
rable extent be teaching what to think. Psychologically, in- 
rination and education cannot be separated. 
Tom the standpoint of the individual, attitudes derive their im- 
in terms of mental hygiene. The individual's own evalua- 
of his conduct and desires in relation to the system of social 
as he understands them constitutes the basis for social-emo- 
ial adjustment leading either to a happy, effective, socialized in- 
idual or, at the other extreme, to a complete disintegration of the 
inality. The individual's attitudes toward his associates, play- 
5s, pupils, teachers, institutions, customs—all have a basic effect 
is mental ease or mental dis-ease. His attitudes also affect what 
rceives, what he remembers, and, in fact, what he thinks. Such 
fal mechanisms as rationalization, projection, and “sour grapes" 
low familiar household words in elementary textbooks in psy- 
By. "Our intellect," said G. Stanley Hall, “is a mere speck 
Lon a sea of feeling." The attitudinal make-up of the pupil is 
ta of feeling. The integration of the intellectual-emotional life 
is as much a matter of attitude as is social integration, or 
problem of holding society together. 
ore specifically, attitudes are a vital concern of guidance, and con- 
ently of educational evaluation, because they affect: 


86 What Should Be Evaluated? 


1. The pupil's fitness for various curricula. Unless a pupil has a fa- 
vorable attitude toward a set of instructional objectives and sets 
them up as desirable goals for himself, the educative process will 
not be maximally effective. 

2. The pupil's fitness for various occupational goals. Bingham (4: 
82) has summarized the reason for concern with attitudes as re- 
lated to occupational fitness in terms of answers to the questions 
whether (a) the individual will like the actual work of an occu- 
pation; (b) the individual will find himself among congenial as- 
sociates, with interests similar to his own; (c) symptoms of the 
individual’s future abilities may be uncovered; (d) alternative 
fields of occupation which may not yet have been seriously con- 
sidered may be discovered. 

3. The pupil's fitness for eventual effective and desirable participa- 
tion in a democratic social order. Attitudes toward social groups, 
institutions, practices, and policies, such as, respectively, Negroes, 
freedom of speech, initiative and referendum, or an unbalanced 
budget, are all attitudes in which society and the schools have à 
real stake. Most important of all, perhaps, are the pupil’s attitudes 
toward the acceptance of social responsibility. Any guidance con- 
cerned with the basic essentials of individual and social progress 
requires the evaluation of the pupil's respect for his future right 
to vote, his amenability to social change, and his sensitivity to 
social problems. Hence evaluation of pupils’ attitudes assumes a 
role of fundamental importance in guidance. 

The above remarks have intended to demonstrate why teachers 
should be concerned with attitudes as they are determined both by 
out-of-school and by in-school experiences. Although this argument 
cannot have full meaning until we understand what attitudes are, 
it does express the purpose of this chapter. Our present thesis is 
simply that guidance requires evaluation of pupils’ attitudes. 

Our treatment of what attitudes are will proceed in terms similar 
to those used in the treatment of mental ability. First, we shall pre- 
sent general descriptions and definitions of attitudes and the rela- 
tionships of attitudes to other concepts allied with them in psycho 
logical terminology. Secondly, we shall discuss the structure or of- 
ganization of attitudes in the personality. Thirdly, we shall discuss 
the determiners of attitudes, the ways in which they develop, and 


Attitudes ‘ 87 


lationships to other aspects of individuals. Fourthly, we shall 
t an enumeration and description of particular, specific atti- 
hich are of importance to guidance. 


DEFINITION OF ÁTTITUDES 


Was stated above, attitudes may be roughly defined as feelings 
against something. This definition serves to provide a frame- 
upon which may be hung more exact terminology so as to 
a more rigorous definition. The term “feeling” points to the 
independence of attitudes from detailed, rational, intellectual, 
ive, mental operations. Rather, attitudes are linked to the 
ns; pleasant and unpleasant associations—fear, rage, love, and 
variations and complications in these emotions brought about 
ning—play a part in attitudes. The phrase "for or against" 
s the directionality of attitudes, the fact that they are char- 
zed by approach or withdrawal, likes or dislikes, avoidant or 
tendencies, favorable or unfavorable reactions, loves or hates, 
are responses to specific or generalized stimuli. The word 
hing” signifies that attitudes are not merely mental images or 
zed ideas, but rather take on meaning only when they are 
d in relation to some object, situation, or stimulus, A 
characteristic of attitudes is that they have an effect on be- 
or which may be so great that the attitude enables the prediction 
lavior, or which may be influenced in such a way by other 
es, social and attitudinal, that the behavior will not follow the 
d attitude, as when a pupil who expressed opposition to 
proceeds to cheat on an examination. A fifth characteristic 
ides, to be treated later in more detail, is that they are ac- 
or learned. 
ary, then, an attitude may be defined as a more or less 
lized tendency, organized through experience, to react posi- 
or negatively toward (for or against) a psychological object. 
des and Certain Allied Concepts.—Certain concepts more or 
y allied to attitudes, some of them essentially synonyms 
I-synonyms, may be considered here, since they are frequently 
"not only in popular discussion but in technical literature as 
concepts are interests, motives, instincts, appreciations, 
mores, morality, morale, ideals, social distance, and character. 


J 
) © 


88 What Should Be Evaluated? 


Other similar concepts could be listed, but these will serve the pur- 
pose of the present discussion, which will seek to show that each of 
these concepts, from a dynamic, functional point of view, is consti- 
tuted of attitudes. 

Interests as observed are presumably the reflection of attractions 
and aversions in our behavior, of our feelings of pleasantness and un- 
pleasantness, likes and dislikes. In terms of action they are charac- 
terized by seeking-acceptance at one end of the scale and by avoid- 
ance-rejection at the other (8 : 15). A distinction may be made be- 
tween attitudes and interests in that the latter merely indicate the 
degree to which the individual prefers to hold an object before his 
consciousness whether he reacts approvingly or disapprovingly 
toward that object, while attitudes indicate his reaction in terms of 
its direction, pleasantness or unpleasantness, agreement or disagree- 
ment. But since we do not prefer to hold an object before our con- 
sciousness unless we agree with it or find it pleasurable, the distinc- 
tion is a very fine one, and attitudes and interests are for practical 
purposes identical. If this is the case, why do we use the term atti- 
tude instead of interest? The answer is that theoretical and experi- 
mental social psychology has produced a vast literature around this 
concept and has used the word "attitude" to denote it. However, in 
vocational psychology and guidance the term "interest" has been 
mainly used. 'Thus Bingham (4 :62) defined interest as a "tendency 
to become absorbed in an experience and to continue it." The simi- 
larity of this definition to the definition of attitudes is apparent. The 
use of a separate term for such a tendency serves only to complicate 
our thinking and to hinder the application of experimental and the- 
oretical work in attitudes to the field of "interest." For this reason 
we shall here use the term attitude as inclusive of such mental func- 
tions as vocational psychologists and guidance workers have called 
"interests." 

Motives are related to attitudes in that the latter, with their direc- 
tionality and feeling-tone, may constitute an important aspect of 
motives. Thus a highly favorable attitude toward a particular teacher 
may motivate a pupil toward emulation of that teacher. Hero wor- 
ship is an extreme but illustrative form of attitude operating as mo- 
tive. Here again the two concepts are so closely related that any dis- 
tinction between them is difficult to make; however, attitudes refer 


» 


Attitudes 89 


more specifically to the goal of a motive, toward the attitude object, 
while the motive refers to the force, derived in part from the atti- 
tude, which is used in overcoming obstacles to the individual's 
achievement of a satisfactory relationship with the attitude object. 

Instincts are nowadays in disrepute, at least as applied to human 
behavior. In Chapter V we indicated the confusing multiplicity of 
terms arising from the concept of instincts. Since attitudes to an even 
less extent than other psychological aspects of individuals may be 
considered to be inborn and unaffected by experience, the distinction 
between attitudes and inborn instincts seems especially sharp. How- 
ever, in so far as attitudes and instincts have both been required to 
bear a large part of the burden of theories of motivation it is highly 
probable that the same aspects of behavior are frequently denoted by 
thc two terms. 

Appreciations are considered in the plural rather than the singular 
in order to retain the perspective of the operational point of view, 
whereby concepts are defined in terms of the operations performed 
in observing or knowing them. The term "appreciation" is loosely 
used in at least two widely different senses. “I appreciate your point 
of view" obviously uses the term with the connotation of understand- 
ing or comprehending—a cognitive, intellective meaning. Apprecia- 
tion of literature, music, graphic and plastic art—in short, aesthetic 
appreciation—on the other hand, connotes emotionalized, affective 
processes. It is in this latter sense that the term concerns us. Its kernel 
here is the acceptance-rejection notion, and this is readily subsumed 
under the concept of attitudes. 

Taste as a term is used practically synonymously with appreciation. 
Although it, like the term appreciation, carries a cognitive connota- 
tion, it implies also the acceptance-rejection, likes and dislikes mean- 
ing, and therefore operationally falls under the heading of attitudes. 

Mores, morality and morale are three terms conveniently treated 
together. In both their linguistic and their social-psychological mean- 
ings they are closely associated. Mores are defined by Sumner (16 : 
59) as "the ways of doing things which are current in a society to 
satisfy human needs and desires, together with the faiths, notions, 
codes, and standards of well-living which inhere in those ways, hav- 
ing a genetic connection with them.” The mores “are social ritual in 
which we all participate unconsciously” (x6 : 62); they define “right” 

e 


go What Should Be Evaluated? 


and "wrong" for a particular social group. Though usually having a 
rational origin, many of the mores become obsolescent through 
changing social conditions but tend nevertheless to continue as cate- 
gorical imperatives, and tend as such not to be questioned. They are 
subjectively highly emotionalized and violations of them are 
“wrong,” “sinful,” “obscene,” “in bad taste,” and, in general, not tol- 
erated by the group. Their emotional loading brings them readily 
within the purview of the concept of attitudes, at least from the op- 
erational, measurement point of view. 

The term morality in its popular meaning has come in this country 
to be restricted largely to sex morality but is properly to be used as 
referring to all social sanctions implemented by the mores. The moral 
act, then, is one which is in harmony with the mores either in doing 
what they demand or in refraining from doing what is contrary to 
them. The moral person is one who observes the mores. The con- 
nection of morality from the point of view of social psychology with 
the notion of attitudes was indicated at the beginning of this chapter. 
Morale refers to the esprit de corps, the emotional integration of a 
group. Its essence psychologically is the integrating pattern of atti- 
tudes which exist with reference to attitude objects judged to be of 
vital concern to the group, these being frequently of a threatening 
nature. Or these attitude objects imply goals the achievement of 
which is endangered by the absence of morale. Thus we have the 
morale of an army or of the civilian population in war, the morale 
of the classroom, of an industrial organization, of a teaching staff, 
and so forth. 

Ideals are the conscious aspects of the mores (5 : 472-473). Con- 
scious striving toward "ways of doing things" most acceptably is the 
essence of ideals. They are the individual's conscious adjustment to 
the demands of society, the public, as conceived and understood by 
him. The public may be a "phantom" public and need have no 
counterpart in reality. Consider, for example, the attempted control 
of a small boy's behavior by means of Santa Claus. 

Social distance, a sociological concept, refers to the differences be- 
tween the mores or the ideals of individuals and groups and implies 
tensions or clashes inherent in different sets of mores and ideals. 
Social distance exists between management and labor, rich and poor, 
the younger generation and the older, one national or racial group 


5 


Attitudes 9r 


and another. Strikes, lynchings, race riots, and even wars are the re- 
flections in large part of the social distance between groups of human 
beings. In so far as such distance persists we shall have failed to 
achieve the “brotherhood of man." The reduction of social distance 
between groups and the evaluation of the individual man on his own 
merits are major parts of the democratic ideal. And in studying and 
working with social distances we are again concerned in the main 
with attitudes. 

Character has ethical connotations sufficiently unambiguous to re- 
quire little analysis here. The person of character is the moral person 
as defined in terms of attitudes. He behaves in accordance with in- 
dividual ideals that are socially approved. 

This brief and sketchy exploration of the meanings of related terms 
is, perhaps, sufficient to show the central nature of the concept of 
attitudes. Further elaboration of the concept in all its varied aspects 
is contained in Dewey's classic Human Nature and Conduct (7) and 
in the writings of the Allports (x, 2), who attempt to relate the con- 
cept to the structure and functions of the nervous system, as "neuro- 
muscular sets." 


Tue ORGANIZATION OF ATTITUDES 


In the field of attitudes as in the field of mental ability questions 
concerning organization are concerned mainly with the degree of 
specificity or generality or of the size of the organizational units to 
be observed in an individual’s personality or behavior. That is, are 
attitudes organized into large structures or small ones, into “cobble- 
stones” or “grains of sand”? For example, are we justified in calling 
a person “liberal” or “conservative” and, from this general label, in- 
ferring his attitude towards a large number of more specific attitude 
objects such as races, nationalities, internationalism, labor, income 
taxes, and so on? Similarly, are we justified in labeling a person 
“honest” or “dishonest” and thereby inferring whether he will exhibit 
honest or dishonest behavior in a wide variety of situations such as 
taking an examination, returning excess change to a grocer, volun- 
tarily confessing a rule violation in an athletic competition, and so 
forth? 

These questions concerning the generality or specificity of attitud- 


inal organization have not been answered in the same way by vari- 
e 


92 What Should Be Evaluated? 


ous researches and experiments which have been concerned with 
them. That is, some researches have drawn conclusions in favor of 
specificity and some in favor of generality. Perhaps the most influ- 
ential of the researches supporting the doctrine of specificity have 
been the studies by Hartshorne and May (9) of the traits and atti- 
tudes of honesty, service, and self-control. Broadly stated, their con- 
clusions are that we are unjustified in considering these traits and 
their related attitudes as general characteristics of children but rather 
that they are a function of the specific situation in which a child is 
placed and that an individual behaves similarly in different situations 
only in so far as these situations are alike. For example, although it 
is possible to state that the child who cheated in one classroom sit- 
uation would cheat in another, the same child might be scrupulously 
honest in athletics, party games, etc. Stealing money was unrelated to 
stealing answers on examinations. Consequently the traits and atti- 
tudes centering around the concept of honesty were considered to be 
highly specific. 

Other studies, however, have drawn very different conclusions con- 
cerning the organization and interrelationships of attitudes. For ex- 
ample, Cantril (6) examined the responses of a sample of college stu- 
dents to a series of terms, statements, personality sketches, and the 
Allport-Vernon Study of Personal Values. He found evidence for 


generality of some sort in mental life which is independent of specific ` 


content. Similarly, Herrick (1o) employed a group of mental tests, 
rating devices, autobiographical sketches, and interviews and also 
drew conclusions in favor of the existence of general attitudes of a 
college student group concerning certain social issues and matters of 
conduct. 

Further evidence in favor of some generalization of attitudes 
within individuals is found in the significantly greater than zero 
correlations which have been found between various attitudes, such 
as those toward pacifism, communism, and the church. Radicalism- 
conservatism has been found to be a general attitude enabling pre- 
diction of more specific attitudes toward races, national ideals, im- 
perialism, militarism, international good will, birth control, religion, 
etc. Even in the field of character traits and attitudes Herrick, in the 
study cited above, was able by the method of factorial analysis to 
show that the data of Hartshorne and May contained certain “group 


2 


Attitudes 93 


factors” or clusters of character traits and attitudes, a finding in 
sharp disagreement with the extreme specificity inferred by those 
authors. 

The differences in the conclusions concerning generality and speci- 
ficity may in part at least be due to the differences in the ages of the 
groups studied. That is, generality of mental organization may be- 
come greater as age increases; the integration and self-consistency of 
attitudes would thus be a function of the amount of time during 
which an individual has been under the influence of the socially or- 
ganized sets of attitude patterns by which his own consistency was 
judged. Consequently, teachers should adjust their expectations of 
consistency in a pupil’s attitudes to his age and maturity; high school 
pupils will thus be less consistent than college students but more 
consistent, integrated, and predictable from situation to situation and 
from one attitude to another, than are elementary school pupils. 

Another factor determining the degree of generality found is the 
narrowness of the attitude considered. The more narrowly we define 
an attitude, the more closely it is related to other similarly narrowly 
defined attitudes and the higher the degree of generality which we 
will infer. For example, if we consider attitudes toward the pro- 
hibition of wines and beer as one attitude, and attitudes toward the 
prohibition of whiskey and brandy as another distinct attitude, we 
shall find these two attitudes highly correlated and conclude that we 
have evidence for a general attitude toward the prohibition of alco- 
holic beverages. But this illustrates merely the procedure used in 
constructing instruments to measure attitudes, whereby a group of 
closely knit, internally consistent statements of attitudes are assem- 
bled so as to provide a statistically reliable test. (The meaning of sta- 
tistical reliability will be discussed in Chapter X.) If this is the case, 
what should be our criterion or basis for distinguishing between 
single, unitary attitude entities and attitudes which should be con- 
sidered not as units or entities in themselves but merely as compo- 
nents or elements of attitudes? Where should we stop in our break- 
down of attitudes? There is, of course, no absolute rule in this mat- 
ter. Attitudes, or the attitude object of reference by which they are 
most conveniently denoted, should be defined or considered as en- 
tities separate from other attitude objects in terms of the practical 


reasons for our concern with attitudes. 
c 


94 What Should Be Evaluated? 


If a teacher has reason to be concerned with a pupil's attitudes 
toward work in general, then that may be considered an attitude. 
A more specific problem will lead the teacher to be concerned with 
attitudes toward a specific kind of work, such as home work or 
classroom tasks. A still more specific problem will lead the teacher 
to be concerned with home work in a specific subject, such as algebra 
or history. These considerations may be summarized by a quotation 
from Newcomb (x2 : 1029): 

“The needed caution is simply that 2o attitudes as measured are 
genuine entities in the sense that there is anything ‘absolute’ about 
them. For practical purposes any set of verbal responses which is 
statistically reliable may be considered an entity and given an ap- 
propriate name. Any measured attitude, no matter how reliable, 
might conceivably be broken down into two or more different atti- 
tudes with slightly different labels. A single label implies nothing 
concerning singleness of attitudes. Indeed, the concept of singleness 
has meaning only in regard to the object of reference. If the attitude 
measured has reference to a custom, person, or institution commonly 
accepted as an isolable phenomenon and is reliably measured, it may 
be regarded as an entity.” 

In summary, then, the generality or specificity of attitudes may be 
considered to be a function of (1) the degree to which attitude ob- 
jects or attitudes themselves are organized into sets of related clus-_ 
ters by the society in which an individual lives; (2) the degree to 
which an individual has absorbed the structure or organization of 
the society in which he lives, which in turn is the function of his 
age, maturity, and sensitivity to social forces; and (3) the narrowness 
with which an attitude is defined, broader more inclusive attitudes 
being more likely to be independent, self-contained, and “specific” 
than are attitudes more narrowly conceived, 


Tue Dererminers OF ATTITUDES 


Why does a given individual have a certain attitude? Where shall 
we look for the causes or origins of a person’s attitudes? Why do 
people differ in attitudes? That is, why are some liberal and some 
conservative, religious or atheistic, favorable or unfavorable toward 
a given teacher, a given subject, vocation, or any other attitude ob- 
ject? One answer to these questions has been provided by G. W. All- 
port (2 : 810-811). He pointed out four ways in which attitudes are 


Attitudes 95 


developed; they may be labeled (1) integration, (2) differentiation, 
(3) shock, and (4) adoption. Integration is the development of an 
attitude through accumulation of a large number of experiences over 
a long period of time, all of which influence the individual in a 
given direction. Thus, long-continued failure in solving arithmetic 
problems will be integrated by a pupil into an unfavorable attitude 
toward arithmetic. Development of an attitude by differentiation 
may be described as the splitting off of a specific attitude from a 
more general one, as when an individual has an unfavorable atti- 
tude toward arithmetic as a result of his unfavorable attitude toward 
all school subjects. Attitude development by shock is due to an un- 
usual, violent, or painful experience; a child's attitude toward den- 
tists may thus be quickly and forcefully molded by the experience 
of having a tooth pulled. Finally, an attitude may. be developed by 
adoption, in that the individual merely follows the example of 
friends, teachers, parents, newspapers, and other opinion-molding 
agencies; the daughter who is a Republican merely because her 
father is a Republican illustrates this way of developing an attitude, 

Another point of view in regard to the determiners of attitudes is 
exhibited in the classification of related variables by Newcomb (x2 : 
912-1046). He deals with the relationships between attitude and 
(1) individual characteristics, (2) experimental variables, (3) life 
experiences, and (4) other attitudes. 

1. Under the heading of individual characteristics are included 
sex, age, intelligence, and such non-intellectual characteristics as 
muscle coordination, suggestibility, persistence, susceptibility to ma- 
jority influence, ability to break long-established habits, speed of re- 
action time, tendency to sacrifice accuracy to speed, ability to think 
in unusual terms, neurotic tendency, ascendance-submission, and 
other personality variables. 

2. Under the heading of experimental modification, Newcomb 
deals with investigations in which attitudes have been measured 
before and after the introduction of some experience presumed to 
modify them, these experiences being in turn classified into those 
occurring within the classroom and those occurring as extra-school 
experiences, The school experiences are chiefly those due to particular 
curricular materials, especially in the social sciences, and those due 
to specific teachers, usually either liberal or conservative with re- 
spect to broad issues. The extra-school experiences include motion 


96 What Should Be Evaluated? 


pictures, radio programs, pamphlets, speeches, and other forms of 
“propaganda.” 

The characteristics of methods which appear to be most effective in 
the experimental modification of attitudes are summarized as fol- 
lows: (a) They should be vivid, novel, emotionally charged, and 
realistic (as opposed to laboratory-like). (b) There should be neither 
strong opposing influences nor opportunity to become familiar with 
the complexities of or the objections to the point of view being ad- 
vanced. (c) The methods should use individuals, groups, institutions, 
or symbols thereof, which have prestige value for those whose atti- 
tudes are to be affected. 

3. Under the heading of life experiences as determiners of attitude 
are included the more “normal,” everyday kind of experiences. Typ- 
ical of these are such factors as educational level reached or amount 
of education received, college military training in the R.O.T.C,, the 
experience of college fraternity life, familiarity and contact with 
races and nationalities, continued experiences of association with a 
particular family, allegiance to particular religious groups, racial or 
national background, socio-economic status, residence in rural or 
urban environment or in different national or cultural areas. 

The influences of these “natural” experiences are summarized as 
follows: (a) Few generalizations are justifiable concerning experi- 
ences which are more or less individual rather than being associated 
with a particular group or a particular locality. (b) Experiences 
which are regularly associated with family, race, or church groups 
enable better prediction of attitude. (c) Experiences shared by larger 
communities geographically defined enable practically no prediction 
of attitude, except for urban-rural differences in racial prejudice 
and superstitious belief and for sectional differences within this 
country in regard to Negroes and Orientals. 

4. Under the heading of the interrelationship of attitudes, New- 
comb deals with the question of generality and specificity which 
has already been discussed in this chapter. 

Our discussion of the determiners of attitudes may be concluded 
with a reference to the two most broadly conceived classifications of 
such determiners: personality differences on the one hand, and con- 
formity-enforcing agencies or social institutions on the other. The 
first of these will tend to produce variability within groups, and the 

a 


Attitudes 97 


second produces differences between groups. In considering a pupil’s 
attitudes for the purpose of guiding him educationally, vocationally, 
or personally, the teacher should look to these two kinds of factors 
for an explanation of his attitudes. The teacher's understanding of 
the nature of attitudes, their organization, and their determiners will 
enable him to make better use of the concept of attitudes in evaluat- 
ing the pupil. Let us now turn to the consideration of what attitudes 
the teacher should evaluate. 


Arrirupes SIGNIFICANT FOR GUIDANCE 


As already stated, the most convenient way to denote an attitude 
is by its attitude object, that is, by the thing toward which the atti- 
tude is held. The reason for this is that all other properties of an 
attitude—its directionality, its feeling-tone, its motivating power— 
are not sufficiently different from attitude to attitude to enable them 
to be used as definite labels for a particular attitude. Consequently, 
in choosing from among all possible attitudes those which are most 
significant for guidance, we shall be concerned mainly with attitude 
objects. 

Any attempt to classify or bring some sort of systematic order into 
the field of attitude objects, if it were to lay claim to any other than 
mere practical usefulness, if it were to make pretentions toward sci- 
entific validity or unique truthfulness, would have to be based on ` 
some metaphysical system whereby the contents of the universe were 
ordered. Obviously, no such attempt can be made here. Rather we 
shall seek to select and arrange attitude objects solely for the purpose 
of clarifying and simplifying the evaluation of pupils for guidance 
purposes. The sort of effort we shall make has been exemplified by 
the work of Horne (11), who attempted by the questionnaire and 
rating scale techniques to determine what a representative popula- 
tion regarded as socially significant attitude objects and to rank these 
objects in order of social significance. From his studies there emerged 
233 attitude objects, which were classified by two psychologists into 
eight categories: 


1. Personality 5. Government 
2. Education 6. Social problems 
3. Economic activities 7. Recreation and exercise 


4. Family 8. Religion 


98 What Should Be Evaluated? 


All of the 233 attitude objects fitted into one of the categories, the 
two judges agreeing almost completely as to the classification of the 
objects. Another classification of attitudes such as we shall here at- 
tempt is apparent in a paper by Nelson (13) which summarizes the 
experimental literature on social attitudes under four headings: At- 
titude toward personal ideals, toward political issues, racial attitudes, 
and religious attitudes. 

"The classification made by Horne will be used as a framework 
upon which to formulate our presentation of attitude objects sig- 
nificant in guidance. We shall present his eight headings with illus- 
trations under each drawn either from his study or from other 
sources, 

Personality.—Personality as an attitude object classification refers 
to such broad, all-embracing concepts as “personal values” and “level 
of aspiration." According to Spranger (15), an individual's personal- 
ity is best understood in terms of his values, his evaluative attitudes 
toward the common attributes of a number of classes of situations. 
Values may really be considered generalized attitudes. Spranger dis- 
tinguished six types of values: theoretical, economic, aesthetic, social, 
political, and religious. The theoretical type of man values truth. The 
economic man values wealth. The aesthetic man values beauty. The 
social man values people for their own sake. The political man val- 
ues power. The religious man values a “mystical unity with the 
cosmos." The significance of Spranger's theory has been increased by 
the construction of a test (3) to measure the relative strength of these 
six values in an individual, so that at certain age levels (above about 
fifteen years) it is possible to acquire valuable data concerning a 
person's attitudes along these all-embracing lines. 

The level of aspiration concerns the goal which an individual sets 
for himself in the achievement or solution of any task or problem. 
It is his attitude toward the degree to which he should achieve. 
Knowledge concerning the pupil's level of aspiration and its rela- 
tionship to his abilities constitutes perhaps the most important 
se for predicting an individual's achievement along various 
ines, 

Obviously other personality attitudes have been distinguished and 
may prove valuable. The two we have pointed out are perhaps the 
best known and most thoroughly explored. They should serve as il- 


Attitude 99 


lustrations and bases for further thinking on the question of attitudes 
which color the entire personality. 

Educational Attitude Objects.—School subjects, teachers, and teach- 
ing practices, such as home work, classroom drill, examinations, and 
study periods, are perhaps the most obvious attitude objects within 
the field of school work. Certain administrative aspects of the schools, 
such as the size of classes, the grading system, the system of records 
and reports, the length of the school day, vacations and holidays, are 
also objects toward which pupil attitudes may be significant for 
guidance. 

Economic Activities.—In the economic area attitudes toward spe- 
cific vocations, either as measured directly or as inferred from the 
similarities between an individual’s interests and the interests of 
those engaged in a particular occupation, are obviously of high im- 
portance for guidance. 

Family.—A«ttitudes toward parents, brothers and sisters, the home 
and community environment, and toward the family as an institu- 
tion may throw considerable light on questions arising in the field 
of personal guidance and, consequently, in all guidance. A more ex- 
tensive treatment of this fundamentally important aspect of pupils 
is given in Chapters VII and XVIII. 

Government and Social Problems.—In the broad field of govern- 
ment and social problems we can merely suggest some of the most 
important of the innumerable attitude objects. General liberalism- 
conservatism has been the most explored aspect of this field. Social 
Sensitivity, or awareness and feeling of responsibility for social prob- 
lems, is a similarly inclusive aspect. Racial attitudes, international at- 
titudes, attitudes toward government ownership, social insurance, 
civil liberties, conservation, immigration, and other specific social in- 
stitutions and practices may become foci in the measurement both of 
the outcomes of social science instruction and of the pupil’s fitness 
for his proper role in a democratic society. 

Recreation and Exercise —What an individual likes to do for fun, 
his preferences and interests in the fields of recreation and exercise, 
has been considered important in the field of educational evaluation 
Not so much for its own sake but for the light these attitudes throw 
on vocational interests and personal and social adjustment. That is, 
questions concerning recreational interests have constituted a large 

e 


100 What Should Be Evaluated? 


part of tests and inventories in these other fields. Their use in this 
way, of course, does not preclude their more narrow function as 
indices of recreational organization which schools and communities 
should provide or consider in their plan to meet the needs of youth. 

Religion.—The complexity of this field of attitude objects, the 
wide variation between communities in the nature of their concern 
with religious attitudes, the traditional American separation between 
church and state agencies, all operate to render this area too difficult 
to be treated here. It is included merely for the sake of complete- 
ness. Teachers and school administrators will be able to distinguish 
the more meaningful objects within this attitude area in the light 
of their own particular community situations and needs. 


SUMMARY 


The importance of attitudes as educational outcomes to the 
integration of society and of individuals is emphasized. A definition 
of attitudes is given and the relationships of the concept to certain 
allied terms are described. The organization of attitudes in terms 
of generality and specificity is considered and the conditions under 
which various degrees of generality will be found are discussed. 
The determiners of attitudes are discussed in terms of four different 
ways in which they may be developed and of the various related 
variables which have been studied in the experimental literature. 
Attitudes significant for guidance are classified into eight categories; 
illustrations of typical attitude objects in each category are given. 


QUESTIONS 


1. Make an attempt to distinguish between education and indoctrina- 
tion. Is it possible to apply the basis of your distinction so as to 
classify any teaching activity as education or indoctrination? 

2. Give illustrations of the failure of attitudes to achieve the integration 
of a social group and of an individual person. 

3. What are practical implications of a high degree of generality or 
specificity, respectively, in pupils’ attitudes toward any category 
of attitude objects, such as vocations, teachers, or contemporary 
social institutions? 

4. Examine your own attitudes for illustrations of each of the four 
general ways in which attitudes may be developed. 

> 


5. 


6. 


Attitudes 101 


Give illustrations of ways in which attitudes in each of Horne's 
eight categories may affect a pupils fitness for various curricula, for 
various occupations, and for good citizenship. 

What social attitudes are responsible for the sensitivity of educators 
concerning the problem of teaching attitudes in the schools? 


REFERENCES 
. Allport, F. H., Social Psychology, Boston: Houghton Mifflin Com- 
pany, 1924. 


2. Allport, G. W., "Attitudes" in Murchison, Carl (ed.), Handbook 


of Social Psychology, Worcester: Clark University Press, 1935. 
. Allport, G. W., and Vernon, P. E., A Study of Values, Boston: 
Houghton Mifflin Company, 1931. 


4. Bingham, W. V., Aptitudes and Aptitude Testing, New York: 


Harper & Brothers, 1937. 


5. Briggs, Thomas HL, Secondary Education, New York: The Mac- 


millan Company, 1933. 


6. Cantril, H., “General and specific attitudes,” Psychological Mono- 


graphs, Vol. 42, No. 192, 1932. 


7. Dewey, John, Human Nature and Conduct, New York: Henry 


Holt & Company, Inc., 1922, p. 41. 


8. Fryer, Douglas, The Measurement of Interests, New York: Henry 


Holt & Company, Inc., 1931. 


9. Hartshorne, H., and May, M., Studies in the Nature of Character, 


- 


Li 


New York: The Macmillan Company, Vols. I-III, 1928-1930. 


o. Herrick, V. E., “The generality and specificity of attitudes," Ph.D. 


Thesis, University of Wisconsin Library, unpublished; referred to 
in Young, Kimball, Personality and Problems of Adjustment, New 
York: F. S. Crofts & Co., 1940, p. 288. 

- Horne, E. Porter, “Socially significant attitude objects,” Studies in 
Higher Education XXXI, Bulletin of Purdue University, 37 : 117- 
126 (1936). 


12. Murphy, G., Murphy, L. B., and Newcomb, T. M., Experimental 


Social Psychology, New York: Harper & Brothers, 1937. 


13. Nelson, Erland, “Attitudes: III. Social attitudes,” Journal of Gen- 


eral Psychology, 21 : 417-436 (1939)- 
. Remmers, H. H., “Attitudes as educational objectives," University 


of Washington, College of Education Record, 7 : 68-75 (1941). 
- Spranger, E., Types of Men (transl. by P. J. W. Pigors), Halle: 
Niemeyer, 1928. 


16. Sumner, W. G., Folkways, Boston: Ginn and Company, 1906. 


TT RITTER 


E 
i 
CHAPTER VII 
f i 
Environment and. Background i 

H 

H 


THUS FAR IN OUR DISCUSSION OF ASPECTS OF PUPILS WHICH SHOULD BE 
evaluated, we have been concerned only with results, effects, out- 
comes, or present status of the pupil. Of course, when the evalua- 
tion of each of these aspects is carried on over a period of time, it 


DE 


becomes possible to conceive of yesterday's "present status" as a 
background, cause, or determiner of today's "present status." For 
example, our evaluation of the pupil’s freshman achievement in 
mathematics furnishes data for our prediction and understanding of 
his senior achievement in mathematics; in this case, freshman 
achievement is the environment or background of senior achieve- 
ment. But even when we have evaluated achievement, physical 
aspects, mental abilities, adjustments, and attitudes over a consider- 
able period of time and also at the present, these groups of aspects 
do not include the whole constellation of aspects of pupils denoted 
by the phrase "environment and background." 

It is the purpose of the present chapter to furnish a somewhat 
detailed notion of what is meant by “environment and background,” 
to organize this field into a group of meaningful categories, and 
to discuss the significance of these various categories as determiners 
of the other aspects of pupils and as essentials for the evaluation 
of pupils for guidance purposes. 

It is important in using data concerning a pupil's environment 
and background to realize that none of these data taken alone 
permits hard and fast prediction of a student's standing with regard 
to any of the other aspects. This is so because of the large over- 
lapping which occurs in any aspect of pupils when they are grouped 
according to a single item of environment or background. For ex- 

102 


4S a NAA 


Environment and Background 103 


ample, although there is a positive and undoubtedly significant 
correlation between father's occupation and pupil's intelligence, the 
overlapping among occupational groups makes it impossible to 
predict an individual student's intelligence or achievement from 
knowledge of his father's occupation. A similar condition will be 
found to prevail with respect to the other items of background 
and environment. However, an acquaintance with the experimental 
findings concerning some of these relationships will prevent. the 
operation of various prejudices and unfounded beliefs such as those 
concerning the temperament of "only children." Furthermore, such 
relationships as are found to be significant do not always hold the 
same significance for guidance practice. Some relationships exist 
because of cause and effect. Other significant correlations between 
items concerning pupil background and, say, pupil adjustment may 
merely reflect their mutual dependence upon a common third 


- factor. For example, is the relationship between pupil’s achievement 


in school and father's occupation one of cause and effect, or is it 
due to some underlying factor on which occupation of the father 
is itself dependent? These considerations point to the need for 
caution in drawing inferences from any single fact about a pupil's 
background or environment. 

This chapter will merely furnish clues and suggestions concerning 
yariables in which it is probable that data useful for understanding 
and prediction will be found. None of these areas taken alone will 
infallibly serve these purposes. Nor will the gross accumulation of 
every possible fact and figure in all of these areas provide the 
Significant datum which will enable a teacher to understand, pre- 
dict, and guide a pupil's achievement, intelligence, attitudes, or ad- 
justments. Only by the exercise of shrewd common sense combined 
With scientific circumspection and psychological insight will the 
teacher know where to look, in all the vast area of pupil environment 
and background, for the data significant in the solution of a par- 
ticular guidance problem. 

The total area of environment and background may be divided 
into the home, the community, and the school. The interdependence 
of these three divisions is self-evident, since knowledge concerning 
any one almost always enables us to form fairly safe conclusions 
Concerning some aspects of the others. "Thus, to illustrate with an 


104 What Should Be Evaluated? 


extreme, the knowledge that the father of a family presides over a 
great corporation enables safe prediction concerning the general 
nature of the community in which his children live and of the 
school which they attend. This interdependence does not, however, 
lessen the convenience and clarity afforded by a consideration of 
environment and background under these separate headings. Let 
us now turn to a consideration of each of these divisions in turn— 
their general nature, their subdivisions, and the significance of each 
total and its part for the various aspects of pupils. 


Home BACKGROUND AND ENVIRONMENT 


"The description of the pupil's home and family may be facilitated 

by considering it from four points of view: 

1. Parent-to-parent relationships 

2, Parent-to-child relationships 

3. Child-to-child relationships 

4. Socio-economic status 

From the standpoint of parent-to-parent relationships, families 

may be classified on an accord-discord scale concerning which Cook 
(9:130) points out that accord homes are integrated and coopera- 
tive while discord homes are the opposite. A third category of 
homes is the broken home: families where an adult member has 
been removed by death, divorce, desertion, or other cause, such as 
employment for the mother outside the home or for the father in a 
distant community. The connection between parent-to-parent rela- 
tionships and personality development has been implicitly recognized 
by all students of the subject. Where the two parents deeply love 
each other, in the fullest sense of that relationship, and manifest their 
love in a healthy, socially desirable manner, a home atmosphere 
is created which must be recognized as conducive to mentally 
healthy children. Conversely, unhappy marital relationships are 
conducive to maladjusted children. For example, Baruch (x) found 
significant “coexistence” between child adjustment and such inter- 
parental relationships as tension over sex, ascendance-submission, 
lack of consideration, lack of cooperation on the upbringing of the 
child, extra-marital relations, tension over health, inability to talk 
over differences to mutually acceptable solutions, tension over in- 


Environment and Background 105 


sufficient expression of affection, tension over friends, work, and 
relatives. 

Similarly Mowrer (xg) has classified marital conflict situations 
into the following types: (1) culture conflicts, sometimes resulting 
when two individuals of different cultural background marry; 
(2) response conflicts, due to certain habits of affection, leaning 
upon and being attached to other members of the family, expecting 
partiality and special consideration; (3) dominance conflicts, due 
to differences in the dominance or submissiveness of the roles which 
each of the partners expected to assume in the marital relationship; 
(4) sex conflicts, due either to the other types of conflict or to dif- 
ferences between man and wife in their expectations as to the 
nature of sex relations. This classification may provide help in 
understanding a particular family situation and its relationships to 
pupil maladjustment. 

The relation of unsettled or broken home conditions to the 
scholastic success of high school pupils was studied by Curtis and 
Nemzek (xo). They divided broken homes into six classes: loss of 
father by death, loss of father by divorce or separation, unemploy- 
ment of father, loss of mother by death, loss of mother by divorce or 
separation, or employment of mother outside the home. For each 
of these six classifications 50 pupils were located and paired with 
pupils from normal homes on the bases of intelligence, chronological 
age, grade in school, sex, and nationality. For each of the 600 pupils a 
Measure of academic success was computed, an honor point average 
based upon teachers’ marks. In seven comparisons the data indicated 
that the school achievement of pupils from broken homes was in- 
ferior to that of pupils from normal homes. Risen (22) corroborated 
these conclusions concerning the effect of broken homes by his find- 
ing that lack of one or both parents affected the child’s intelligence 
quotient unfavorably, increased the amount of over-ageness, increased 
the number of failures in school subjects, and increased the child’s 
chances of becoming a problem case for the school counselor. These 
findings, although far from perfectly conclusive, indicate the ad- 
visability of looking into relationships between a pupil’s parents for 
clues to the pupil’s behavior. 

The relationships between parents and children have been classi- 
fied by Stagner (26 : 299) into those centering about love or affec- 


106 What Should Be Evaluated? 


tion, and those centering about discipline, or the fear relationship. 
It is these relationships which are paramount in the molding of the 
child's personality. If the home is the most important part of the 
child's life, then the parent-child relationships within that home are 
its most crucial aspect for personality development. The White 
House Conference on Child Health and Protection (30 :299-300) 
stated its conclusion concerning the significance of the home for 
the child's personality development as follows: *Of paramount in- 
fluence are the subtle, intangible relations of family life such as 
affection, confiding in parents, trust and loyalty of child to parents 
(as measured by a statement of no criticism), and control by other 
means than punishment. In importance for child development they 
seem to outweigh by far the role of the external aspects’ of the. 
home, upon which emphasis has in the past been placed, such as 
economic status of the family, number of rooms to a person, and 
formal education of the children.” 

Concerning the love or affection relationship between parents and 
child it is convenient to conceive of a scale ranging from overpro- 
tection to rejection. Overprotection arises when the parents secure 
an excessive emotional satisfaction from satisfying the demands of 
the child without imposing any restrictions or limitations upon 
these satisfactions. It is an excessive performance by the parent of 
his function of protecting the child. It may result in stunting the 
child’s progress toward self-reliance and independence, both by 
withholding opportunity for practice and by fostering an overat- 
tachment of the child to the parent. Rejection is the opposite of 
overprotection in that it means denial of affection and care to the 
child. Thus the child is cast upon his own resources for emotional 
and social security and affection. Either or both parents may reject 
him, with consequences both for the parents’ relationships with 
each other and for the other parent's relationship with the child, 
as when rejection by one parent results in overprotection by the 
other. Such rejection in the form of criticism, punishment, or lack 
of affection, or disguised in the form of overprotection, has been 
found to have definite effects on child personality. Symonds (27) 
concluded that the rejected child is likely to show aggressive traits 
and be antagonistic toward others. Similarly, Burgum (6) and 
Witmer and her colleagues (31) have found rejected children to be 


Environment and Background 107 


in need of clinical treatment. In conclusion, the reader is referred 
to the survey by Symonds of the entire field of parent-child relation- 
ships (28). 

Child-to-child relationships are usually considered in terms of the 
relationship between birth order, or ordinal position in the family, 
and other aspects of the child, For example, what are the effects on 
a child's intelligence, personality, health, or adjustment of being 
the only child in a family, or the last-born child of twelve chil- 
dren, or a boy with five sisters, and other such situations? Perhaps 
the most complete classification of ordinal positions has been made 
by Krout (16), who was able to distinguish twenty-six ordinal posi- 
tions, thirteen for each sex. To groups of subjects representing each 
of these twenty-six positions, Krout administered a schedule from 
which he determined (1) by which parent a child was most favored 
or disciplined, (2) which parent dominated in the family, (3) to 
which siblings the subject was most attached, submissive, or domi- 
nant, and (4) the attitudes of dominance, submissiveness, and at- 
tachment of each subject toward males and females outside his 
family. While the results are too elaborate to be presented here, 
it is sufficient for our present purpose to record that significant 
relationships were found between ordinal positions and behavior 
patterns within the family. 

This, however, is perhaps the only recent study which has claimed 
positive findings concerning the psychological importance of birth 
order. Krout cites eight recent studies of such variables as intel- 
ligence, vocabulary and sentence development, juvenile thefts, 
neuroticism, personality and incidence of stuttering, all of which 
have produced essentially negative data. And from the literature 
of the past seventy years on birth order, he concludes that virtually 
every study establishing some point with regard to birth order has 
been refuted by a later study proving the possibility of an opposite 
point in the same connection. However, the study of Thurstone 
and Jenkins (29), controlling the factor of socio-economic status 
by testing only siblings, concluded that “intelligence increases, on 
the average, with order of birth in the same family.” This conclusion 
can be interpreted, of course, only in the light of the relatively small, 
even though statistically significant, differences obtaining between 
the various birth-order groups. 


a 


108 —— What Should Be Evaluated? 


Similarly, the personality characteristics of the only child, in either 
the kindergarten or the university, have only infrequently been 
found to be different in any practically important degree from those 
of children who have siblings. Remmers (21) found that “only” 
children were significantly more frequent among distinguished 
(honor) students than were students who had brothers or sisters. 
But despite the smallness of the relationships and the consequent 
impossibility of forecasting with certainty from knowledge con- 
cerning ordinal position in the family, this factor still remains a 
suggestive and potentially enlightening source of understanding 
concerning the guidance problems of any particular pupil. 

A more fruitful point of view concerning child-to-child relation- 
ships and their effects on other aspects of pupils considers them in 
terms of attitudes and social relationships. Hero worship or shame’ 
for a brother or sister, jealousy, the functioning of an older child 
as a parent substitute for one or more of the younger children may 
serve to explain much in the personality of a particular pupil. In 
the latter case, imitation, suggestion, and identification would exist 
as frequently in the relation between younger and older children 
as they would in the relation of child to parent. 

The fourth consideration in evaluating a pupil’s social and eco- 
nomic background and environment is socio-economic status, which 
has been defined by Chapin (7) as “the position that an individual 
or a family occupies with reference to the prevailing average stand- 
ards of cultural possessions, effective income, material possessions, 
and participation in group activities of the community.” Another 
definition of socio-economic status may be derived from the four 
sections into which Kerr and Remmers (14) organized their 
American Home Scale: (1) the aesthetic level, the presence of cer- 
tain objective elements in the home which should tend to promote 
attitudes of aesthetic appreciation in children; (2) the cultural level; 
(3) the economic level; and (4) miscellaneous. This conception of 
socio-economic status is representative of the recent advances in 
this field over previous formulations of the concept in terms of a 
single environmental variable, such as occupational status, income, 
effective buying power, possession of a telephone, or other single 
factors. However, if only one item relating to socio-economic status 
can be taken into consideration, the occupation of the father is prob- 


2 


Environment and Background . 109 


ably the most significant. It is usually related to income and bears 
strong implications concerning the family's degree of economic 
security, advantages in travel, possession of books and magazines, 
and other cultural, recreational, and vocational opportunities. 

The relationship between socio-economic status and other aspects 
of pupils has been shown to be definitely significant in the case of 
some aspects, and either non-significant or beclouded by the inter- 
action of other variables in the case of others. Let us now survey 
representative studies of these relationships. The achievement of 
instructional objectives and the proportion of pupils in elementary 
and secondary schools who succeed in graduating are both positively 
related in a significant degree to socio-economic status. Thus 
Kornhauser (15) reported relationships between retardation in 
school and an index of socio-economic status (the possession of a 
telephone). Similarly, Holley (13) found a significant relationship 
between the number of years a child remains in school and the 
schooling of the parents, the rental value of the home, the real 
estate assessment, and the number of books in the home. More 
recently, Collins and Douglass (8) have found relationships between 
junior high school success and socio-economic status. Concerning 
the physical aspects of the child, the White House Conference on 
Child Health and Protection concluded that the lower the socio- 
economic status, the poorer the parents’ health and the poorer the 
health of the children. The intelligence and mental ability of pupils 
has consistently been found to increase as socio-economic status 
increased, although there is always large overlapping between 
groups. Since this question has already been briefly discussed in 
Chapter IV, it will not be further treated here. The evidence con- 
cerning emotional and social adjustment in its relation to socio-eco- 
nomic status is contradictory and confused, some investigators record- 
ing relationships in one direction and some in another direction, 
still others finding no relationship at all. For example, the White 
House Conference obtained the following very low correlation 
Coefficients: 

1. Socio-economic status and teachers’ ratings on desirable 
character traits, .17 + .03 

2. Socio-economic status and teachers’ rating on social aggres- 
siveness, .16 + .03 


110 What Should Be Evaluated? 


3. Socio-economic status and teachers’ ratings on emotional 
stability, 04 + .03 

4. Socio-economic status and well-balanced personality as 
shown by personality tests (Thurstone’s Neurotic Inventory), 
.0013 + .03 

It is thus concluded that there is a very slight tendency for high 
socio-economic status to be associated with social aggressiveness 
and desirable character traits as reflected in teachers’ ratings of 
pupils (5 :112). However, Springer (24, 25) found that children 
from middle-class homes showed greater emotional stability than 
children from a lower social status. Here again the true situation is 
probably that low socio-economic status can cause emotional and 
social maladjustment when it is associated with and accompanied by 
parental maladjustment and other factors which thwart the normal 
urges of the individual. It is safe to conclude that the socio-economic 
status of children constitutes a fruitful area in which to search for 
clues that will explain a pupil’s adjustment. 

The attitudes of pupils are also related to the home environment 
in the light of the substantial correlations which have been found 
between parents’ attitudes and those of their children. Typical of the 
studies which have indicated this is Peterson’s (20). He found that 
children’s attitudes are much like their parents’ toward such objects 
as the New Deal, attending movies, and the Negro. 

Other studies, however, have shown that attitudes toward voca- 
tions, or vocational interest, may bear no relationship* to home 
status. That is, children from low-income, low-opportunity homes 
are prone to set their occupational ambitions at the professional 
level about as frequently as are children from upper middle-class 
homes who obviously will have a far greater chance of realizing 
their ambition. Where such vocational interests determine curricular 
choices and result in a lack of fit between training and opportu- 
nity, serious waste of educational resources and of personal effort 
must inevitably result. That the teacher should be concerned with 
home background as an indicator of opportunity and of the realism 

1'These data concerning lack of correlation between vocational aspirations and 


socio-economic status have been obtained by Remmers and Kerr in an unpublished 
study of high school seniors in an industrial city in northern Indiana. 


Environment and Background III 


of the vocational plans and interests of pupils is obvious from these 
considerations. 

We may end our discussion of the home background and environ- 
ment of pupils with the following summarizing statement: Parent- 
to-parent relationships, parent-to-child relationships, child-to-child 
relationships and socio-economic status all constitute significant and 
valuable information for the guidance of pupils, both as they deter- 
mine and affect various aspects of the pupil himself and as they 
form a part of the environment to which he must adjust. 


Tue Community ENVIRONMENT AND BACKGROUND 


This division of the pupil’s total environment and background 
may be defined in terms of exclusion, that is, as those parts of the 
environment which are neither of the family nor of the school. 
Conversely, the community environment and background includes 
such social conditions as economic depression and unemployment, 
customs and traditions, leisure-time activities (such as playing and 
camping), motion pictures, and radio. Economic conditions can 
influence pupils both directly through their effect on attitudes and 
opportunities and indirectly through their effect on the home and 
school. Prolonged depression and unemployment may thus produce 
severe personality disorganization. Rundquist and Sletto (23) ad- 
ministered tests to several thousand high school students, college 
students, people on relief, and other adults. Comparison of the 
employed and unemployed revealed wide differences with respect 
to their attitudes toward the economic order and with respect to 
general adjustment, but the two groups were not differentiated by 
feelings of inferiority or unfavorable family attitudes. 

Thus economic conditions will affect pupils directly if they are 
about to enter the world of work, and indirectly through the ad- 
justment engendered in their parents by the economic situation. 
Unemployment produces crowded living conditions and inadequate 
places to play and study, and these in turn increase the difficulty of 
making adjustments at home and at school. Meltzer (18) concluded 
from a study of children’s attitudes as influenced by economic dep- 
tivation that economic insecurity is associated with emotional in- 


£ 


112 What Should Be Evaluated? 


security, but that above certain low limits economic security does 
not imply emotional security since children from the middle class 
manifested better adjustment than children from the upper eco. 
nomic levels. In the early 1930’s the effects of the depression on the 
social attitudes of high school and college youth produced what has 
been termed “the revolt on the campus.” Buck (4) has shown how 
the depression lessened the disapproval of debt and “socialistic” 
plans by 2000 university students. Breemes, Remmers, and Morgan 
(2) similarly reported that a decided increase in the liberalism of 
groups of college sophomores between 1931 and 1934 had not been 
lost by comparable groups of students measured in 1937 and 1939. 
"There had thus been a shift in student opinion at the beginning of 
the depression which had remained stable through the next six 
years of the nation's economic and political development. 

Social customs, traditions, or mores, as was stated in Chapter VI, 
constitute a conservative, change-resistant force in the community. 
The attitudes of the community at large toward social institutions 
and personal values exert a profound influence on the individual 
pupil. That these mores frequently lag behind changing actualities 
is evident from the social history of the past decade. Middletown, 
the typical American small city, has undergone profound social and 
economic changes, but the thinking of its citizens has remained 
substantially the same (17). Symbols and faiths, the beliefs in in- 
dividual initiative, rugged individualism, equality of opportunity, 
and the “chicken in every pot” have resisted the onslaught of the 
depression and the collectivism of the New Deal. Such disparities 
between mores and realities were found by the Lynds to be pro- 
ducing the expected difficulties in adjustment. That teachers will 
find in social mores an important source of understanding: their 
pupils is self-evident. 

In Chapter VI we intimated that differences between urban and 
rural areas, between North and South and other geographical divi- 
sions can profoundly affect pupil attitudes. For example, Davis (x1) 
was able to compare the influence of such disparate mores as those 
obtaining in Soviet Russia and in our own country. In Soviet Russia, 
as the result of what the schools teach, the profit system, whereby a 
man makes a living by selling products for more than they cost 
him including his own wages, is viewed with as much charity as 


Environment and Background 113 


we should show to racketeers who charge one price for doing a 
task and then use force to impose upon some worker a subcontract 
giving the worker half the price for doing the task. Consequently, 
Davis found that Soviet school children actually respected the oc- 
cupational title “ditch-digger” far more than “banker.” Here is a 
striking case of the importance of community customs for individual 
attitudes. 

The effect of leisure-time activities on adjustment, their construc- 
tiveness or destructiveness, may similarly be so significant as to 
constitute data essential to guidance. Thus a pupil’s hobbies, the 
voluntary activities which he finds to be inherently interesting, and 
the role which he takes in these leisure-time activities when they 
involye other individuals may reveal the developmental level of his 
personality. For a pupil may fail to grow from one developmental 
level of play into the next. The child who engages only in individual 
activities and fails to join groups, as in becoming a member of a 
“gang,” may be manifesting symptoms of serious maladjustment. 
The individual who fails to observe the customary sex differences 
in play activities or to take a normal interest in the opposite sex 
at a certain period in his development is similarly deserving of the 
teacher's concern. Of obvious importance to the pupil's total ad- 
justment would be his membership in a delinquent gang from 
which he acquired attitudes and habits rendering him incapable of 
assuming his normal role in society. 

The role of movies and radio in molding attitudes of both adults 
and children has been demonstrated by several researches (see 
Chapter V1). Bruel (3) reported on motion pictures which provide 
experiences conducive to the development of neuroses and found 
that these experiences are not limited to early childhood. The sig- 
nificance of the radio may similarly be judged from the study by 
DeBoer (12). He found that about one-third of the children he 
studied lay in bed thinking over what they had heard on the radio, 
that 30 per cent had recently dreamed about some radio program, 
three-fourths of these dreams being of the nightmare type. The 
potency of movies and radio as forces for maladjustment carries with 
it the implication that these agencies could be equally potent as 
forces of education. In any case, they frequently constitute an im- 
portant area of the pupil's community background. 


114 What Should Be Evaluated? 


Tue ScHooL ENVIRONMENT AND BACKGROUND 


This part of the pupil’s environment can be evaluated only if 
teachers and school administrators are willing to evaluate themselves 
and their own work (see Chapters XVIII and XIX). The mental 
hygiene of the pupil’s teacher, the teaching practices resulting from 
it and from the teacher’s professional training and equipment, the 
curricular organization of the school, its administrational organiza- 
tion (e.g, the platoon system), the school's physical facilities, its 
provisions for individual differences, its marking system, guidance 
system, and all other aspects of its functioning can and should have 
a vital effect upon the pupil. The complex problem of evaluating 
the school will be taken up in Part II. Here it is our purpose only to 
round out the discussion of pupil background and environment by 
drawing attention to the role of the school conceived as a part of a 
pupil's environment. 


SuMMARY 


Environment and background are distinguished from other aspects 
of pupils in the sense of their being determiners rather than dimen- 
sions of pupils. The nature of the relationships between various 
factors in environment and background and the various aspects of 
pupils considered in other chapters is discussed. The total environ- 
ment and background are divided into the home, the community, 
and the school for separate consideration. Parent-to-parent relation- 
ships, parent-to-child relationships, and child-to-child relationships 
are discussed in terms of their effects on pupils’ emotional and so- 
cial adjustment and attitudes. Socio-economic status is considered in 
relation to techniques of measurement and to the various other 
aspects of pupils. Economic conditions, social customs, and movies 
and recreational activities are all found to be significantly related to 
pupil attitudes and emotional adjustment. Teachers, teaching prac- 
tices, curricular organization, and other parts of the school are men- 
tioned as significant determiners of various aspects of pupil develop- 
ment. 


QUESTIONS 


1. In what way are experimentally observed relationships between 
factors of environment and aspects of pupils useful in guidance? 


Environment and Background 115 


With a pupil you have known, illustrate the effect of undesirable 
intra-family relationships on school work. 

To what extent may two children in the same family be considered 
to have similar and different family backgrounds? Illustrate with 
aspects of family background you would consider similar and those 
you would consider different. 

How may recent community and world events have affected the 
attitudes and adjustments of pupils? 

In your own school experience have there been teachers who affected 
pupil emotional adjustment adversely? Favorably? By what means? 


. Compare the community in which you grew up with the one in which 


you are now living, even if they bear the same name. What differences 
are there that would affect the various aspects of pupils considered 
in Chapters II to VI? 

Each individual is at least three individuals: as he sees himself, as 
others see him, and as he thinks others see him. Work out in tabular 
or graphic form all the possible interrelationships with respect to 
parent, teacher, and child, taking the above threefold point of view 
into consideration. 


REFERENCES 


- Baruch, Dorothy W., "A study of reported tension in the inter- 


parental relationships as coexistent with behavior adjustment in 
young children," Journal of Experimental Education, 6 : 187-204 
(1937). 


- Breemes, E. L., Remmers, H. H., and Morgan, C. L. "Changes in 


liberalism-conservatism of college students since the depression," 
Journal of Social Psychology, 14 : 99-107 (1941). 


- Bruel, O., “A moving picture as a psychopathogenic factor: a paper 


on primary psychotraumatic neurosis,” Character and Personality, 
7368-76 (1938). 

Buck, W., “A measurement of changes in attitudes and interests of 
university students over a ten-year period,” Journal of Abnormal and 
Social Psychology, 31 : 12-19 (1936). 

Burgess, E. W. (chairman), The Adolescent in the Family; a Study 
of Personality Development in the Home Environment (Subcom- 
mittee on the Function of Home Activities in the Education of the 
Child, White House Conference on Child Health and Protection), 
New York: D. Appleton-Century Company, Inc., 1934. 


116 


What Should Be Evaluated? 


. Burgum, Mildred, “Constructive values associated with rejection,” 


American Journal of Orthopsychiatry, 10 : 312-326 (1940). 


. Chapin, F. Są “A quantitative scale for rating the home and social 


environment of middle-class families in an urban community: A 
first approximation to the measurement of socio-economic status," 
Journal of Educational Psychology, 19 : 99-111 (1928). 


. Collins, J. H., and Douglass, H. R., "The socio-economic status of 


the home as a factor in success in the junior high school," Ele- 
mentary School Journal, 38 : 107-113 (1937). 

Cook, L. A., Community Backgrounds of Education, New York: 
McGraw-Hill Book Company, Inc., 1938. 


. Curtis, E. A., and Nemzek, C. L., “The relation of certain unsettled 


home conditions to the academic success of high school pupils," 
Journal of Social Psychology, 9 : 419-435 (1938). 

Davis, Jerome, “Testing the social attitudes of children in the govern- 
ment schools of Russia," American Journal of Sociology, 32 : 4597 
471 (1927). 

DeBoer, J. J., "Radio and children's emotions,” School and Society, 
59 : 369-373 (1939). 

Holley, C. E., The Relationship Between Persistence in School and 
Home Conditions, National Society for the Study of Education 
Yearbook, Vol. 15, Part 2, 1916. 

Kerr, W. A., and Remmers, H. H., “The construction and validation 
of a group home environment scale,” Proceedings of the Indiana 
Academy of Science, 50 : 201-206 (1941). 

Kornhauser, A. W., “The economic standing of parents and the 
intelligence of their children,” Journal of Educational Psychology, 
9 : 159-164 (1918). 


. Krout, M. H, "Typical behavior patterns in twenty-six ordinal 


positions,” Pedagogical Seminary and Journal of Genetic Psychology, 
55:3-30 (1939). 

Lynd, R. S. and Helen, M., Middletown in Transition, New York: 
Harcourt, Brace & Company, Inc., 1937. 

Meltzer, H., "Economic security and children's attitudes toward 
parents," American Journal of Orthopsychiatry, 6 : 590-608 (1936). 
Mowrer, Harriet, Personality Adjustment and Domestic Discord, 
New York: American Book Company, 1935. 


. Peterson, T. D., “The relationship between certain attitudes. of 


parents and children,” Studies in Higher Education XXXI, Further 
Studies in Attitudes, Series II, Bulletin of Purdue University, Vol. 
37, 1936. 


Environment and Background 117 


Remmers, H. H., "Distinguished students—what they are and why," 
Studies in Higher Education XV, Bulletin of Purdue University, 
Vol. 31, 1930. 


. Risen, N. L., "Relation of lack of one or both parents to school 


progress,” Elementary School Journal, 39 : 528-531 (1939). 


. Rundquist, E. A., and Sletto, R. F., Personality in the Depression, 


Minneapolis: University of Minnesota Press, 1936. 


. Springer, N. N., “The influence of general social status on the 


emotional stability of children,” Pedagogical Seminary and Journal 
of Genetic Psychology, 53 : 321-328 (1938). 


. Springer, N. N., “The influence of general social status on school 


children’s behavior," Journal of Educational Research, 32 : 583-591 
(1939). 


. Stagner, R., The Psychology of Personality, New York: McGraw- 


Hill Book Company, Inc, 1937. 


. Symonds, P. M., "A study of parental acceptance and rejection,” 


American Journal of Orthopsychiatry, 8 ; 679-688 (1938). 
Symonds, P. M., The Psychology of Parent-child Relationships, 
New York: D. Appleton-Century Company, Inc., 1939. 


. Thurstone, L. L., and Jenkins, R. L., “Birth order and intelligence," 


Journal of Educational Psychology, 20 : 641-651 (1929). 


. White House Conference on Child Health and Protection: The 


Young Child in the Home, New York: D. Appleton-Century Com- 
pany, Inc., 1936. 


* Witmer, Helen L., and others, “The outcome of treatment of 


children rejected by their mothers," Smith College Studies in Social 
Work, 8: 187-234 (1938). 


PART TWO 


How to Evaluate 


INTRODUCTION 


WE HAVE NOW REACHED THE POINT IN OUR TREATMENT OF EVALUATION 
where we must be concerned with the techniques and methodologies 
with which to satisfy the purposes set up in Chapter I by evaluating 
the aspects of pupils discussed in Chapters II-VII. Our procedure in 
this Part will be to present a full treatment of all steps in the evalua- 
tion technique for each aspect of pupils, beginning with the pupil's 
achievement of instructional objectives and proceeding therefrom 
to each of the others in turn. Evaluation techniques have been more 
highly developed for some of these aspects than for others. Espe- 
cially in the field of achievement evaluation, there has been accumu- 
lated a large fund of practical and technical information. It is for 
this reason that our presentation of evaluation techniques begins 
With achievement of instructional objectives. When we have com- 
pleted our treatment of this aspect, such practical and theoretical 
considerations will have been presented as will contribute substan- 
tially to an understanding of the basic concepts and considerations 
involved in all evaluation of any aspect of pupils whatsoever. 


Tue Five Major Srers or EVALUATION 


It is helpful to conceive of the evaluation process as consisting of 
five major steps, which may be distinguished from one another, on 
the basis of their involving, in logical order, distinct sets of opera- 
tions. In one sense, the structure of this book represents a similar 
analysis, in that we were concerned first with the why, then with 
the what, and now with the how of evaluation and measurement. 
The present list, however, presents a more detailed breakdown of 
this “how.” These five major steps are as follows: 

I. Statement of the Purpose and Content of Evaluation.—This 
Step in evaluation procedure has, of course, been treated in the pre- 
ceding chapters. It is‘mentioned here merely to serve as a liaison 
With those chapters, In Part I we presented somewhat detailed 

121 


122 How to Evaluate 


definitions, descriptions, and discussions of the major aspects of 
pupils which should be evaluated. These aspects were in turn de- 
termined by our statement of the purposes of evaluation in Chap- 
ter I, where these purposes were centered around the function of 
providing data for guidance. Thus it is seen that the preceding 
chapters have been designed to help teachers carry out this first 
step in the evaluation process. 

2. Construction or Selection of an Evaluating Device.— This step 
should proceed in close touch with Step 1, of course, since only the 
devices that have been aimed at specific materials and purposes can 
be expected to provide worth-while data on those materials and for 
those purposes, The microscope, for example, can furnish data con- 
cerning only very small organisms because it was designed for the 
purpose of enabling visual examination of objects of certain sizes. 
'This obvious consideration has frequently been forgotten in the 
construction of evaluation devices for those aspects of pupils with 
which we are concerned. Instruments have been designed to evaluate 
achievement without close and critical regard for what was meant 
by achievement. Consequently, as was indicated in Chapter Il, 
many devices intended to evaluate achievement of instructional 
objectives have reflected a narrow conception of this factor. Failure 
to obey this stricture, that construction or selection of evaluation 
devices must be attuned to purposes and objectives, has thus fre- 
quently in the past debased pupil learning, teacher instruction, and 
the data upon which guidance was based. 

3. Administration of the Evaluating Device—This step is easily 
understood in terms of illustrations from everyday life. If our pur- 
pose is to measure the length of a room and we have selected or 
constructed a device for this purpose in the form of a foot rule, 
then we administer this device by placing it as many times as pos- 
sible end-for-end along the lengthwise dimension of the room. Simi- 
larly, we administer the device for evaluating a pupil’s temperature 
by placing a fever thermometer underneath his tongue. For each of 
the evaluating devices to be discussed, the method of administration 
will have to be specified either explicitly or by inference from the 
classification to which the device belongs. 

4. Interpretation of the Data Yielded by the Evaluating Device.— 
Any set of data acquire meaning only after they have been inter- 


Introduction 123 


preted and related to the purpose and content of evaluation. The 
number of times the foot rule is placed along the side of the room 
is interpreted as the length of the room. The position of the column 
of mercury in the fever thermometer must be interpreted in terms 
of the pupil's temperature and health. The scores and other data 
derived from the evaluating devices to be described here must be in- 
terpreted similarly. 

5. Evaluation of the Evaluating Device.—After the device has 
been administered and the resulting data interpreted, the device 
itself may be evaluated in terms of the success with which it serves 
its purpose and of the degree to which it can be improved. The 
criteria by which evaluating devices should be judged—that is, the 
concepts of validity, reliability, and practicality—will be our major 
concern under this heading. It is at this stage of the entire process 
of evaluation, perhaps, that the techniques are most clearly in need 
of orientation with some logical formulation or philosophical basis. 
Here, as in the process of interpreting evaluative data, the discussion 
Will be in terms of quantitative concepts and statistical techniques. 


CHAPTER VIII 


Achievement Testing 


LET US ASSUME THAT THE OBJECTIVES OF INSTRUCTION HAVE ALREADY 
been formulated and stated in accordance with the rules presented in 
Chapter II. At this point, then, the teacher is equipped with a set of 
objectives stated in the form of grouped, unitary, understandable, 
and observable changes in pupils, toward which instructional effort 
has actually been directed, which involve both subject matter and 
mental processes, and which have been determined by community 
and individual needs. 

From this point onward, the construction or selection of the evalu- 
ating device requires that answers be given to the following ques- 
tions: 

1. At what objective or group of objectives is the evaluation device 
aimed? 

2. Which of the following three general types of devices is best 

suited to evaluate the achievement of these objectives? 

a. Devices involving language products, either verbal or mathe- 
matical 

b. Devices involving non-language products 

c. Devices involving the direct observation of behavior or per- 
formance. (Illustrations of these are furnished below.) 

If a device involving language is chosen, should it be an essay 

test or a short-answer test? * 

4. If a short-answer test is chosen, should it be an externally made, 
standardized, purchasable test or a teacher-made test? 


v 


*“Short-answer test” as used in this book refers to the whole class of objectively 
scored test items rather than only to the recall type in which a single word, phrase, 
or sentence is the pupil's response. 


124 


Achievement Testing 125 


5. Ifa teacher-made short-answer test is chosen, what types of ques- 
tions should be used and how should they be apportioned and 
composed ? 

6. If an externally-made, standardized, short-answer test is used, 
how shall it be chosen from among the many available? 

7. Ifa non-language product or behavior device is used, how shall 

it be constructed? 
If an essay test is used, how shall it be constructed? 

In this chapter we shall consider Questions 1 to 4. It will be noted 

that in these questions we have selected the answers for Questions 

I to 5 so as to narrow our problem to the most immediate and prac- 

tical concern of most classroom teachers. That is, most teachers use 

devices involving language in the form of short-answer tests which 
they make themselves, as was indicated in a study by Lee and 

Segel (xx : 6-12) of the testing practices of about 1500 high school 

teachers, 

1. At what objective or group of objectives is the evaluation de- 
vice aimed? Different objectives require different evaluation devices. 
Frequently, not all of the objectives of a class of instruction can be 
approached by even the same type of device. Consequently, the first 
step in constructing an evaluation device after the objectives have 
been formulated is to single out the objective, or objectives, at which 
each device is to be aimed. This procedure of singling out an ob- 
jective so that evaluation devices may be constructed to fit its peculiar 
requirements has been well illustrated by Tyler (20 : 8): 


8 


Each of the eight objectives set up for elementary courses in zoology 
was defined in terms of the behavior expected of students. In defining the 
first objective, a fund of information about animal activities and struc- 
tures, the specific facts and general principles which the student should 
be able to recall without reference to textbooks, or other sources of 
information, were indicated. The second objective, an understanding of 
technical terminology, was defined by listing the terms which the 
student himself should be able to use in his own reports, and another 
list of terms which he would not be expected to use, but should be able 
to understand when he finds them in zoological publications. The third 
objective, an ability to draw inferences from facts, that is, to propose 
hypotheses, was defined by describing the types of experiments which 
an elementary student should be able to interpret. The fourth objective. 


126 How to Evaluate 


ability to propose ways of testing hypotheses, was defined by listing the 
types of hypotheses which an elementary student should be able to 
validate by experiment, or to propose ways of validation. The fifth ob- 
jective, an ability to apply principles to concrete situations, was defined 
by listing the principles which elementary students should be able to 
apply, and types of concrete situations in which the student might apply 
these principles. The sixth objective, accuracy of observation, was de- 
fined by listing the types of experiments in which the elementary student 
should be able to make accurate observations. The seventh objective, skill 
in the use of the microscope and other essential tools, was defined by 
describing the types of microscopic mounts and types of dissections which 
elementary students should learn to make. The eighth objective, an 
ability to express effectively ideas related to zoology, was defined by 
indicating the nature of the reports, both written and oral, which 

. Zoology students are expected to make and the qualities demanded for 
these reports to be effective. 


2. Which of the following three general types of devices is best 
suited to evaluate the achievement of these objectives? 
a. Devices involving language, either verbal or mathematical 
b. Devices involving non-language products 
c. Devices involving direct observation of performance 
"These types may be illustrated by reference to the objectives listed 
in the quotation from Tyler. Devices involving either verbal or 
mathematical language (here language is conceived as any system of 
communication involving symbols) would apply to the first, second, 
third, fourth, fifth, and eighth objectives. Devices involving non- 
language products would be applicable to the seventh objective, skill 
in use of the microscope and other essential tools, since this skill 
would be reflected in the quality of the microscopic mounts and 
types of dissection made by the students. Devices involving direct 
` observation of behavior would similarly be applicable to the sixth 
and seventh objectives, since the evaluator directly observes the 
student's behavior, the movements of his body and eyes, the position 
and vantage points adopted, the skill in manipulating instruments. 
"The first type of device, that involving verbal or mathematical lan- 
guage, is represented by the "paper and pencil" test, which is by far 
the most frequently used type of evaluation device. The preponder- 
ance of written language or paper and pencil devices in the evalua- 


Achievement Testing 127 


tion of aclievement is merely a reflection of the generality of 
language in the school curriculum and in our civilization. 

Devices involving non-language products are usually in the form 
of a check list or rating scale upon which may be indicated the 
presence or absence, or degree of quality, of the salient features of the 
product. Typical products are those which pupils turn out in art 
courses, chemical laboratories, home economics laboratories, wood- 
work and metalwork shops, and in many other kinds of school work. 
Thus under the heading of non-language products may be listed such 
objects as drawings, precipitates, pies, dresses, chairs, bookends, and 
wrenches. Devices involving behavior are those which enable an 
observer using his eyes and ears to evaluate on a rating scale or 
check list a pupil's achievement of desirable ways of moving or 
speaking. Typical observations under this heading would be clarity 
of speech, posture, habits of recitation in class, physical. education 
achievement, and speed and efficiency in handling scientific apparatus 
or laboratory equipment. 

Obviously, in the majority of cases, teachers will find devices in- 
volving language most suited to their instructional objectives. 

3. If a device involving language is chosen, should it be an essay 
test or a short-answer test? Although the distinction between these 
two types of tests cannot have escaped anyone who has gone to an 
American school, it is presented to furnish the basis of our discussion. 
of considerations in choosing between them. An essay test is one in; 
which the pupil’s response is in the form of a complete sentence or 
series of sentences. A short-answer test is one in which the response 
is a single word or phrase, number or mark. An illustration of an 
essay test is the following: "Discuss the causes of the American 
Revolution" or *In which, and for what reasons, can you do better, 
oral or written examinations?" An example of a short-answer test is 
the following item: *America was discovered by (1) Leif Ericson, 
(2) Vasco Da Gama, (3) Columbus, (4) Ponce de Leon." 

Let us now compare the merits and limitations of essay tests and 
short-answer tests in the light of the following considerations: 

a. Reliability? of grading or scoring 
b. Extensiveness of the sampling of achievement 
? Reliability may be roughly defined for the present as the consistency with which 


a device yields the same results. If the device does not yield consistent measurements, 
€ 


128 . How to Evaluate 


c. Possibility of the pupil’s guessing or bluffing 
d. Pre-test, or motivational, effects on pupil achievement 
e. Post-test, or instructional, effects on pupil achievement 
f. Labor required in construction 
g- Labor required in scoring or grading 
A. Cost of administering 
i. Attitudes of pupils toward each type of test 
j. Intellectual pleasure and growth derived by teacher from 
constructing and scoring 
k. Distorting effects on achievement of the medium of ex- 
pression 
l. Fitness for evaluation of complexly organized achievement 
a. Reliability of Grading or Scoring.—In general, the short-answer 
test can be graded more reliably, or consistently, than the essay test. 
This means that a given short-answer test paper will be given the 
same score no matter who grades it or on how many different oc- 
casions it is graded by the same person, if we disregard clerical 


0 10 20 30 40 50 60 70 80 90 100 
Percentage Mark 
Fic. 3.—Range' of percentage marks assigned by 12 home economics 
teachers to 7 essay-type examination papers of gth-grade students. 
(After 2 : 17.) 


it is unreliable. A rubber band, for example, would be an unreliable device for 
measuring the circumference of a cylinder. 


ere erem qaia 


Achievement Testing 129 


errors, On the other hand, essay tests will not be graded in the 
same way either by different people or by the same person at dif- 
ferent times. The unreliability of rating essay tests has been dem- 
onstrated repeatedly by experiments extending in time from those of 
Starch and Elliott (18) in 1912, to those of Hartog (7) in 1935. All 
‘of these investigators have found that scores on essay tests graded 
according to the usual methods by different graders vary widely. 
Brown (2:33) reports on “an experiment, conducted with home 
economics seniors who had completed their supervised teaching 
and graduate students who were experienced teachers, [which] 
showed that essay-type tests in home economics were as difficult to 


- Score accurately as were tests in other subjects. The range of scores 


assigned by these judges to seven ninth-grade papers is shown in 
Figure 2.” (Our Fig. 3.) 

Among the factors which may be responsible for this unreliability 
in grading essay tests are: 

(1) Different standards of excellence both among different 
teachers and on the part of single teachers from one occa- 
sion to the next. 

(2) Psychological factors, such as fatigue, affecting ability to 
distinguish between closely allied degrees of merit. As a 
test grader proceeds through a large pile of papers there 
are systematic changes, resulting from factors which, while 
varying for different graders, grossly affect the test scores. 

(3) The influence of handwriting; for example, James (8) and 
Sheppard (17) both found a positive relationship between 
the quality of handwriting and the grade accorded to test 
papers identical in content. 

In any case, although it will be shown later that certain techniques 
can materially reduce the unreliability of scoring essay tests, the 
present conclusion must be that short-answer tests are far superior 
on this count. 

b. Extensiveness of the Sampling of Achievement.—The usual 
short-answer test containing at least fifty items or questions draws 
Upon a far wider range of pupil achievement than does the usual 
€ssay test containing less than fifteen questions. Consequently, short- 
answer tests run less danger of permitting chance variations in 
Student preparation and achievement to have a great effect on test 


130 How to Evaluate 


scores. Such effects occur when pupils do or do not happen to guess 
correctly which one or more of a small number of essay questions 
would be asked. Furthermore, the extensive coverage of subject 
matter and mental processes enabled by short-answer tests tends 
to produce less variation among teachers in the content selected for 
tests. To illustrate, if we conceive of a subject-matter area consisting 
of "two hundred units of knowledge," there will be less variation 
among teachers when each selects one hundred of these units for 
a test, than when each teacher selects only ten. (Great variation, or 
subjectivity in selection of test content, will remain, as will be noted 
later.) Here again, short-answer tests are superior to essay tests. 

c. Possibility of the Pupil's Guessing or Bluffing.—The charge 
is often leveled at short-answer tests that the restricted number of 
alternatives which they present enables a pupil to achieve a higher 
score than is warranted by his true achievement. That is, in a test of 
one hundred items each presenting two choices, pupils could on the 
average make fifty correct responses by following the advice of a 
tossed coin. For completion short-answer test items the guessing 
factor is obviously negligible. Various statistical formulae have been 
offered to correct for this chance factor. However, the problem of 
guessing cannot be entirely eliminated by such mechanically applied 
statistical corrections, since they are based on either theorems of 
probability or on statistical studies of average effects, both of which 
sources of reasoning cannot determine and correct for the individual 
pupil's spurious achievement through guessing. 

"The essay test, however, may be similarly thwarted if it is not con- 
structed and graded with technical expertness. A pupil who knows - 
scarcely anything about a subject may often take advantage of the 
positive effect on grades of such factors as speed of writing, literary 
style, and ability to "free-associate" around a question. Here, how- 
ever, the effect of bluffing and guessing can be counteracted in the 
case of individual students by the exercise of valid, constant scoring 
standards. 

d. Pre-test, or -Motivational, Effects on Pupil Achievement.— 
Does expectation of a short-answer test have a different effect upon 
pupil préparation and achievement than expectation of an essay test? 
Several experiments have attempted to answer this question. 
"Terry (19) found that two classes in educational psychology pre- 


Achievement Testing 13r 


pared for an objective test by emphasizing details, whereas expecta- 
tion of an essay test led them to emphasize larger units of subject 
matter. Similarly, Douglass and Tallmadge (4) found that the 
short-answer test directs attention to detail and exact wording, 
while the essay test encourages study methods involving organiza- 
tion, perception of relationships and trends, and personal reactions. 
Meyer (13) studied the effect of essay and short-answer test ex- 
pectation on both preparation and achievement. Dividing 124 stu- 
dents of Civil War history into four groups he directed one group 
to study for a true-false test, the second group for a multiple-choice 
test, the third group for a completion test, and the fourth group for 
an essay test. At the end of the study period all groups were given 
all four types of tests, He found that all four types of tests were 
equally good for the measurement of recalled facts, but that the 
groups which had studied for completion and essay tests achieved 
higher scores when the students were tested again five weeks 
later. Consequently he concluded that achievement was more perma- 
nent when the student anticipated an essay type of test. Also the 
methods of study were superior, involving more worth-while units 
of subject matter and less attention to rote learning, when: the 
students studied for essay tests. Not only general achievement and 
methods of study were affected, but also the type of material learned, 
Students expecting short-answer tests intended to (1) learn facts, 
(2) memorize statements, (3) put emphasis on details, and (4) learn 
definitions, words, and figures, whereas pupils expecting essay tests 
attempted to (1) get a general view of the material, (2) form per- ` 
sonal opinions, (3) interpret material, and (4) fix the general out- 
line and then add the details. Meyer’s general conclusion was that 
' the individual's examination set—that is, expectation—is of funda- 
mental importance in learning and retaining sense material and that 
the essay examination set should be used in preference to any 
objective examination set if the student is to be able to recall material 
in an organized fashion and to know facts when cues are not given. 
Certain precautions should be taken in the interpretation of 
Meyer's results, They are based on college students rather than 
Secondary and elementary school pupils. Also, it is unknown 
whether the same results would be obtained in other subject-matter 
areas or by other investigators. Pending further investigation, our 


132 — How to Evaluate 


present conclusion must be that essay tests probably motivate better 
preparation, achievement, and retention than do short-answer tests, 

€. Post-test, or Instructional, Effects on Pupil Achievement.— 
This consideration may be broken down into (1) the suggestive 
effect of the untrue material included in short-answer tests and 
(2) the potentialities of the two kinds of tests for use in increasing 
various kinds of achievement. Negative suggestion refers to the 
possibility that the false statements in true-false tests and the incor- 
rect alternatives in multiple-choice tests may implant misinforma- | 
tion in the minds of the pupils. That is, students may retain the 
false elements of tests and later on believe them to be correct. How- 
ever, research (16 : 358-365) has shown that in addition to the weak- 
ness of the psychological basis for such a possibility in disregarding 
mental set, the small amount of negative suggestion is fully offset 
by an even greater positive suggestion, leaving a positive net effect. 
"Thus, on the whole, pupils learn more than they lose by taking true- 
false tests, 

The value of the two types of tests for instructional purposes can 
take two forms. One such value for the essay test is urged force- 
fully by Krey (xo): 


"There is one skill which may, indeed has been, seriously impaired by 
the excessive use of the new-type tests. This is skill in expression, the 
ability to set forth some topic in social science clearly, convincingly, and 
agreeably. Teachers in college have begun to remark that students who 
come to them from school systems in which the new-type test has been 
used almost exclusively for a number of years are unable to express 
themselves cogently either orally or in writing. In school systems in 
which the new-type test has been used extremely, it has been possible 
for students to avoid writing a single complete sentence except in courses 
in English composition. Inasmuch as coherent and cogent composition is 
still one of the most widely used, as it is one of the most valuable, 
skills in social science, it would seem essential to continue to use the 
essay-type examination, if for no other reason than to afford practice 
in this skill. 


Although Krey applies his argument only in the social sciences, 
his point of view has been upheld in many other fields of subject 
matter. Granted the importance of the skills he mentioned, the essay 


Achievement Testing 133 


test becomes essential as a motivator and instructional device for 
the development of those skills and acquires an indispensable role 
in educational procedures. It should be pointed out, however, that 
ability to write "coherent and cogent composition" is a very im- 
portant instructional objective and its achievement should by no 
means be left to the incidental learning which may accrue while 
taking an examination—usually under pressure of the time with 
logical organization not the focus of the pupil’s effort. On the 
contrary, it should be carefully provided for in the instructional 
program and separately evaluated. 

Short-answer tests may also be of instructional value, however, 
as shown by a study by Curtis and Woods (3), who compared the 
teaching values of four common practices in correcting examination 
papers. The following four methods were compared: 

(1) Pupils corrected own papers while the teachers read the 
correct answers. Free discussion followed. 

(2) Teacher checked incorrect items but made no corrections. 
Papers were later returned and discussed item by item. 

(3) Teacher carefully wrote in all corrections. Papers were later 
returned and discussed item by item, 

(4) Teacher carefully wrote in all corrections. Papers were later 
returned but only the questions pupils asked about were 
discussed. 

On the next day and also after six weeks, the test containing one 
hundred items in various short-answer forms was repeated without 
warning. The first method, in which the teacher is least active and 
the pupils most active, resulted in the greatest improvement in 
scores. The other three methods, involving greater degrees of 
teacher activity and pupil inactivity, yielded progressively poorer 
results, We may conclude that short-answer tests can be given in- 
Structional value when they are used in-such a way as to provide 
the practice necessary for increasing achievement. Thus it is seen 
that both the essay test and the short-answer test can have distinct 
Positive effects on achievement, each in its special way. The instruc- 
tional value of the essay test depends upon the importance attached 
to the ability to write essays; the instructional value of the short- 
answer test can be aimed at whatever objectives it is possible to test 
by means of it, 


t 


134 How to Evaluate 


f. Labor Required in Construction—The essay test in its usual 
form requires far less skill, time, and general effort than does the 
short-answer test. This follows from the nature of the two types 
of tests and from the differences between them in the number and 
precision of the questions put to the pupil. As will be seen in the 
following chapter, good short-answer test items can be composed 
only by dint of thoughtful, detailed, and intensive adherence to 
instructional objectives, and the application of much general intel- 
ligence. Essay test questions are fewer in number, usually less specific 
in content, and generally not as carefully composed with regard to 
subtle psychological values. However, as will be seen immediately 
below, the amount of effort required by the two types becomes more 
nearly equal when we come to scoring or grading the tests. Further- 
more, it is possible to accumulate short-answer test questions from 
year to year; they merely require editing to keep them up to date. 
Such accumulation increases (1) the sampling of the subject matter, 
(2) the flexibility available for constructing tests for special purposes, 
(3) the possibility of constructing “equivalent” forms of tests for 
determining reliability. Finally, the accumulation of questions even- 
tually decreases to a small amount the effort required for construc- 
tion of short-answer tests. 

E. Labor Required in Scoring or Grading—Here the short- 
answer test is superior because of the speed and little skill with 
which it may be scored, whereas essay tests must be graded by sub- 
ject-matter experts, slowly and laboriously. Thus while an essay test 
may be constructed in, say, less than one hour it may require perhaps 
ten hours for grading, whereas a short-answer test may require ten 
hours for construction and perhaps one hour for scoring. However, 
the essay test requires expert knowledge for both construction and 
grading, while a short-answer test requires an expert only for its 
construction. Hence, although the illustrative time figures just 
given would probably never fit a real case, it is clear that there is 
little difference between the two types in the total quantity and 
quality of the work involved, unless the number of pupils tested 
becomes far greater than is usual in the average classroom. In this 
case, the short-answer test acquires great advantage from its greater 
scorability. 


Achievement Testing 135 


h. The Cost of Administering.—Shortanswer tests usually are 
reproduced by mimeograph or some other method, so that each pupil 
may have a copy. Essay tests, on the other hand, can escape the 
= cost of reproduction by being written on the blackboard or dictated 
orally to the pupils. The importance of these considerations depends 
- upon the availability of reproducing devices to the classroom teacher. 
_ Since most present-day schools have a mimeograph, hectograph, or 
c other means of reproducing test material, this consideration should 
probably not be allowed to have any considerable effect upon the 
teacher's choice between these two types of tests. 

Furthermore, as is shown in Chapter XX, it is frequently possible 
to obviate the necessity of reproducing teacher-made short-answer 
tests by administering them orally or by using separate answer sheets 
_ which enable test booklets to serve more than one group of pupils. 

i, Attitudes of Pupils Toward the Two Types of Tests.—The 
evidence on this question has differed. Some investigators report 
that pupils prefer essay tests while others report preferences for the 
_ Shortanswer test. Odell (x4 :193) claims that pupils, realizing that 
Marks on essay tests ave relatively unreliable, consider them unjust 
"and prefer short-answer tests for the objectivity and reliability of 

‘Scoring which they insure. However, Jones (9) found that the 
statement, “I think one’s ability is far better shown through dis- 
cussion questions than through short objective questions,” was 
~ agreed to by 68 per cent of the students in colleges which gave senior 
- Comprchensives, and 55 per cent of the superior students in other 
Colleges, Alumni taking both types of examinations offered even 
More favorable comments on the essay test, because they felt that 
Jt was more important to be able to discuss an issue than merely to 
- check it. Similarly, Hanford (6) concluded from a survey of the 
examination system of Harvard College that undergraduates favor 
the “reasoning,” speculative type of examination. 

Ruch (16 : 132-137) concluded from his summary of the studies 
On this question that essay tests and short-answer tests were both 
regarded as unpleasant tasks by both teachers and pupils, with about 
- tquil proportions of students favoring either type. It is almost 
"unanimously agreed, however, that a test can gain the pupil's 
- affections if the students become convinced that it yields a benefit to 


136 ` How to Evaluate 


them. This is illustrated by Boucher (1: 115-116), who has de- 
scribed how students requested additional examinations in order to 
improve their knowledge of their own progress, Final examinations 
were given solely to satisfy their expressed needs for self-evalua- 
tion, the scores on the tests being unrecorded and having no effect 
on their grades. Similarly, either type of test may be favored by 
students, 

j. Intellectual Pleasure and Professional Growth Derived by the 
Teacher.—Since most of the work in short-answer testing goes into 
the construction of the test while most of the work in essay testing 
goes into grading the test, how do the two types compare in the 
pleasure of the work involved? Vernon’s preference (21:247) is 
that with which most teachers will agree: ". . . The setting of a 
New-type test is a fascinating occupation, which can be done in odd 
moments throughout the year; and the marking is simply a routine 
matter which involves no mental strain. By contrast, the marking 
of large numbers of essay-type scripts in psychology is the most try- 
ing work that he has ever had to do." (It has, indeed, recently be- 
come possible to have a machine do the scoring of short-answer 
tests.) 

Constructing short-answer tests probably results in greater pro- 
fessional growth for the teacher than does constructing essay tests. 
The need for thinking through instructional objectives, for intensive 
attention to subject matter, for sophistication in the logical theory of 
measurement, for insight into pupil difficulties and errors is greater 
in short-answer testing. In satisfying these needs the teacher should 
acquire desirable skills that will carry over into all phases of teaching. 

k. Distorting Effect on Manifested Achievement of the Medium 
of Expression—Here we are concerned with whether essay tests 
and short-answer tests introduce such factors between evaluator 
(that is, the teacher) and achievement as to distort the expression. 
of the latter, in the same way that colored glass changes the things 
seen by anyone who looks through it. Do essay tests involve behavior 
on the part of the pupil that colors his achievement? Similarly, do 
short-answer tests elicit behavior that reflects not only achievement 
but also intelligence and test sophistication operating to provide 
him with correct answers even when he could not really claim to 
know the correct response? Essay tests measure ability in expression, 


Achievement Testing 137 


' breadth of vocabulary, speed of writing, and other aspects of pupils 
which may be considered irrelevant to the evaluation of a specific 
kind of achievement. The “halo” effect, the teacher's general im- 
pressions, likes and dislikes of a pupil, also come between the pupil’s 
achievement and the teacher's evaluation of it. 

Short-answer tests, however, are not necessarily free from this 
distortion. A pupil must be able to follow the instructions given in 
a test and then manipulate his knowledge and intellectual resources 
in such a way as to fit them to the requirements of the question. 
Lindquist (x2) has presented an excellent summary of the “ir- 
relevant cues” which may operate so as to twist the achievement of 
a pupil below or above its true value. He emphasizes the need for 
skill in anticipating all the mental devices which, although not a 
part of instructional objectives, enable a pupil to obtain a higher 
Score on a test than his achievement of instructional objectives 
warrants, The tricks of test question phrasing that will reduce this 
distortion to a minimum will be discussed in Chapter IX. Here it 
is our aim only to note that short-answer tests as well as essay tests 
can exert an influence upon what they are testing so as to distort it. 

l. Fitness for Evaluation of Complexly Organized Achievement.— 
It is frequently charged that short-answer tests measure only small 
elements of achievement in the form of details of information and 
collections of facts, and that essay tests are necessary to the evalua- 
tion of general understanding, interpretation, capacity for organiz- 
ing and formulating knowledge, and other complex mental opera- 
tions. This belief is reflected in the findings concerning differences 
in the way pupils Prepare for the two kinds of tests. Short-answer 
tests can, however, be designed in such a way as to get at these 
complexly organized types of achievement. The Evaluation Staff in 
the Eight-Year Study of the Progressive Education Association 
(15), recognizing the importance of such objectives of instruction 
and the necessity for evaluating achievement of them, has con- 
structed short-answer tests with such titles as Interpretation of Data, 
Application of Principles of Thinking (several subject fields), Prob- 
lems Relating to Proof in Mathematics, Critical-mindedness in the 
Reading of Fiction, and others. The claim to measure complexly 
Organized achievements which these titles reflect indicates that at 
least the essay test has not been unchallenged as the fittest device 


138 How to Evaluate 


for this purpose. It is possible to say here, without subjecting these 
new types of short-answer test questions to a critical discussion, that 
they do succeed in measuring some types of mental processes that 
are far more complex than mere rote memory, factual knowledge, 
unrelated bits of information, and other so-called "elementaristic" 
types of achievement. The one type of achievement which short- 
answer tests are incapable of evaluating directly is the ability to 
Write essays; in this field the essay test will always remain the most 
convincing, even if not the most reliable, method of evaluation. 

Summary.—After weighing the relative merits of essay tests and 
short-answer tests in the light of these twelve considerations. we 
may conclude that the essay test may possess some advantage as a 
motivational device, as an instructional device for increasing ability 
to write essays, and as a device for getting at that ability directly. 
On all other counts the short-answer test has been found either 
equal or superior to the essay test. The practical implications of 
these conclusions for the classroom teacher are: 


I. Use short-answer tests, in all the varied forms to be described 
in Chapter IX, to evaluate all types of achievement except 
the ability to write essays. 

2. Use the essay test as one way to evaluate ability to write 
essays. 

3. Use the essay test to motivate pupils to learn certain materials 
(selected carefully for their worth-whileness) in such a way 
that they will be able to recall the materials in an organized 
fashion and to know facts when cues are not given. For this 
purpose the essay test should be used in the improved forms 
to be described in Chapter XII. 


4. If a short-answer test is chosen, should it be an externally 
made, standardized, purchasable test or a teacher-made test? 
Illustrative of standardized tests are the New Stanford Achievement 
Tests in the elementary school subjects and the tests of the many 
cooperative state high school testing services. These, of course, are 
probably not typical because they are among the best of available 
standardized tests, far superior to the general run. They serve 
merely to define by illustration: what we mean by standardized 
tests, 


S 


Achievement Testing 139 


Let us now compare the merits and limitations of standardized 
tests and teacher-made tests in the light of the following considera- 
tions: : 

a. Closeness of fit to instructional objectives 

b. Refinement of construction 

c. Interpretations possible with each type of test 
d. Expenditures and gains of the teacher 

a. Closeness of Fit to Instructional Objectives—Whether instruc- 
tional objectives refer only to subject matter or to both subject 
matter and mental processes, the teacher-made short-answer test 
can usually, if not always, achieve a closer fit to the instructional 
objectives of a particular classroom, school, or school system than 
can a standard test. This is so by the very nature of the two types 
of tests, since the standardized test must strive for content which: 
is common to a great many schools while the content of the teacher- 
made test can fit not only objectives which a particular teacher has 
in common with other teachers but also those-which are peculiar 
to his own conception of his instructional objectives as they reflect 
the abilities and needs of his pupils and his community. As a result, 
standardized tests may evaluate the pupils’ achievement of ob- 
jectives with which they have not been concerned, while neglecting 
other objectives that have constituted a major part of this school 
work. 

This deficiency of standard tests will vary in different subjects, 
of course, in Proportion to the degree of variability among class- 
rooms, schools, and systems in instructional objectives, the relative 
emphasis upon these objectives, and the division of subject matter 
between semesters, In the tool subjects, such as reading, writing, and 
arithmetic, it is quite likely that the differences between classrooms, 
teachers, and school systems are not so great as those in the social 
studies and the natural sciences. In any case, the teacher who uses 
Standard tests should be prepared to face the charge of “testing 
what hasn't been taught.” It is evident that according to this con- 
sideration the teacher-made test will almost always be found 
Superior, 

b. Refinement of Construction.—Under this heading we may 
Consider the selection and expression of test content, the procedure 
for administering the test, and the means furnished for interpreting 


140 Hotw to Evaluate 


the scores. The content of standard tests usually is selected by 
groups of subject-matter experts operating in close contact with the 
most respected textbooks and courses of study, statements of ob- 
jectives, teaching methods, and expressions of the philosophy of a 
subject. Furthermore, each item of the content and its expression 
usually have been subjected to the criticism of many other experts 
and tried out on pupils; from these preliminary tryouts have been 
computed statistical measures of the difficulty and validity? of the 
item. 

On the basis of these statistical measures poor items are weeded 
out and the remaining ones kept, usually arranged in the order of 
difficulty and divided into two or more groups to give equivalent 
forms of the same test. Standardization of administrative procedure 
means that definite directions have been worked out, with pro- 
visions for practice exercises, time limits, oral directions to pupils, 
all designed to result in the best possible testing conditions being 
held uniform from classroom to classroom. Standardization of in- 
terpretation means that scoring directions, question weightings, 
and corrections for guessing have been definitely worked out as 
an inseparable part of the test itself. 

It is possible for classroom teachers to standardize their own tests 
in all of these respects by acquiring and applying the intensive and 
technical statistical techniques required. But it is extremely im- 
probable that most teachers will have the resources necessary for 
extensive preliminary tryouts of tests and for statistical treatment of 
the results. (In Chapter XXI we shall discuss the methods by 
which tests may be refined and improved subsequent to their first 
administration, Sufficient evaluation and improvement of tests by 
such means, even if performed by a single teacher, would result 
in a standardized test in the present sense.) The technical and 
time-consuming nature of the standardization process precludes the 
possibility that many teachers will be able to produce evaluation 
devices as refined as the best standard tests, because no matter what 
technical equipment, experience, and ingenuity a teacher can bring 
to bear on this process, only seldom will he have the zime. In terms 

3]tem validity may be roughly defined here as the success with which an item 
discriminates between good pupils and poor pupils, in terms of their achievement of 
instructional objectives. 


Achievement Testing 141 


of “refinement,” then, standard tests must almost by definition be 
superior to teacher-made tests. 

c. Interpretations Possible with Each Type of Test.—Standard 
tests are supplied with norms which enable comparisons with other 
groups of pupils. Thus a teacher may compare the achievement 
of his class with that of pupils throughout the nation; the standardi- 
zation group of pupils may be divided according to age, sex, or 
grade and made representative either of the nation at large or of 
urban groups, rural groups, etc. "The meaningfulness of compari- 
sons of classroom achievement or of individual pupil achievement 
depend; almost entirely on the degree to which the sample used in 
the standardization process succeeds in representing the specific 
group which it claims to represent. That is, if a set of norms for 
twelve-year-old arithmetic achievement is based on a small group 
of pupils, or on a group which is not representative with respect to 
certain factors affecting arithmetic achievement such as intelligence, 
Socio-economic status, or sex, then the comparisons will be mis- 
leading and the teacher will be unjustified in interpreting a given 
pupil's achievement as above or below that of the average twelve- 
year-old. 

Furthermore, the group used in standardizing an arithmetic test, 
for example, usually is different from that used in standardizing a 
geography test, so that even though both tests may have sets of 
norms for twelve-year-olds, there usually is an unknown degree of 
uncertainty in comparing a pupil's relative achievement in the two 
Subjects; that is, one could not really say that a pupil who had the 
Same standing with respect to the norms in these two tests really 
had the same degree of achievement in the two subjects. Only tests 
of the same quality and form with norms established at the same 
time and under the same conditions for the same groups of pupils 
and schools will enable such comparisons to be made between 
Subject and subject and from year to year. Only then can teachers 
Interpret test scores as indicators of differences within individual 
Pupils from subject to subject and from year to year. Comparisons 
Petween pupils and between groups, moreover, are valuable only 
in so far as the nature of the “normative” group is known and 
appreciated. Since it is improbable that tests constructed independ- 
ently by different classroom teachers will meet these requirements 


142 How to Evaluate 


for comparability, standardized tests are clearly superior with respect 
to the types of interpretations they enable. Not all standardized 
tests, however, can meet these requirements; tests that are inde- 
pendently constructed and standardized on different groups under 
different conditions are not comparable. The Stanford Achievement 
Test is typical of those which have been so standardized as to meet 
requirements for comparability. 

The desirability of "intersubject" comparability and the short- 
comings of independently standardized tests in this respect have 
been well expressed by Flanagan (5:30): 


Norms which are defined in terms of the groups who happen to be 
taking the various subjects in a particular year are bound to fluctuate 
from year to year, no matter how adequate the sample, For instance, 
norms established only a few years ago were not proved satisfactory as 
à representation of our present secondary school population because of 
the great increase in the enrollment of certain types of students in the 
secondary schools, Also, in particular subjects, various factors such as the 
temporary dominance of some educational doctrine or the alteration of 
college entrance requirements may change very greatly the type of in- 
dividual enrolled in a subject. From the point of view of individual guid- 


[Italics ours,] It is very difficult to get a clear picture of an individual's 


i 
| 
| 


objectives, however, frequently makes standardized tests inadequate 
for the evaluation of achievement in a particular classroom, 

d. Expenditures and Gains of the Teacher—Under this heading 

we may first be concerned with the relative amounts of intellectual 
add 4 5 


Since 
activity on the part of the teacher in all steps of the evaluation pro- 
cedure than do standardized tests, they must be judged superior on 


Achievement Testing 143 


this count. It is true, of course, that the use of standardized tests 
thus releases the teacher to devote more effort to instruction and 
other activities. But, as was noted above, the work of constructing 
an objective test yields real benefits to him in professional growth. 
Perhaps similar benefits would follow from the careful study and 
selection of standardized tests, their construction, administration, 
and interpretation, but the difference in any case will usually be in 
favor of teacher-made tests, 

The financial cost of standard tests, which may become consider- 
able when many pupils or frequent testings are involved, must 
€nter into the choice of many teachers and school systems between 
the two types. Also the availability of standardized tests for only 
Whole subjects or large units of subject matter makes them at best 
suitable for only occasional use. That is, they can seldom be adapted 
for evaluation of achievement other than at the beginning or end 
of the semester. This is reflected in the finding of Lee and Segel 
(1t :4) that only 15 per cent of the 1242 responding teachers gave 
three or more standardized achievement tests during one semester, 
Whereas teacher-made tests were used about eight times during one 
Semester by the average teacher (m :2). 

Summary.—The chief advantages of standardized tests are their 
Possession of norms and their greater technical refinement, On 
the other hand, teacher-made tests fit the instructional objectives 

yield greater benefits to the teacher, and are more adaptable 
to continuous evaluation of achievement through a semester, The 
More extensive interpretations possible with some standardized 
tests can be achieved only through careful selection of the tests 
according to the meaningfulness of their norms; they are most use- 
ful in the tool subjects, where the instructional objectives vary least 


Dx from subject to subject and from year to year are to be studied 


144 How to Evaluate 


QUESTIONS 


1. Set up a list of general objectives in some course in which you have 
recently had experience. In how many of these objectives was your 
achievement evaluated? In what ways? 

2. How have your attitudes toward and methods of preparation for 
essay and short-answer tests compared with those reported in this 
chapter? Which kind of test has usually had better motivational and 
instructional effects in your case? 

3. If you were evaluating understanding of this chapter would you use 
an essay or a short-answer test? Why? 

4. Under what conditions would it be desirable for a teacher to use 
the same tests year after year, improving them by whatever experi- 
ence and statistical means are available, instead of constructing en- 
tirely new ones cach year? 

5. Give reasons for and against informing pupils whether a forth- 
coming test is to be of the short-answer or the essay type. 

6. Why are there variations between state and community school sys- 
tems in instructional objectives for non-tool subjects such as American 
history? How do these variations affect the applicability of standard- 
ized tests? 


REFERENCES 


1, Boucher, C. S, The Chicago College Plan, Chicago: University of 
Chicago Press, 1935. 

2. Brown, Clara M., Evaluation and Investigation in Home Economics, 
New York: F, S. Crofts & Co., 1941. : 

3. Curtis, F. D., and Woods, G. G., “A study of the relative teaching 
values of four common practices in correcting examination papers,” 
School Review, 37 : 615-623 (1929). 

4. Douglass, H. R., and Tallmadge, M., "How university students 

prepare for new types of examinations," School and Society, 39 : 318- 

320 (1934). 

Flanagan, J. C, “The Interpretation of the Cooperative Achieve- 

ment Test Scores,” The Cooperative Achievement Tests: A Hand- 

book Describing Their Purpose, Content, and Interpretation, 

New York: The Cooperative Test Service, October, 1936. 

6. Hanford, A. C, “Tests and examinations at Harvard College,” 
Proceedings, 1936, Institute for Administrative Officers of Higher 
Institutions, Chicago: University of Chicago Press, 1936, pp. 5-26. 


5 


1 "Achievement Testing 145 


. 7. Hartog, P., and Rhodes, E. C., An Examination of Examinations, 
International Institute Examinations Inquiry, London: Macmillan & 
Co., Ltd., 1935. 

. James, A. W., “The effect of handwriting on grading," English 
Journal, 16 : 180-205 (1927). 

. Jones, E. S, “The relationship of examinations and instructions," 

Proceedings, 1936, Institute for Administrative Officers of Higher 

Institutions, Chicago: University of Chicago Press, 1936. 

Kelley, T. L., and Krey, A. C, Tests and Measurements in the 

Social Sciences, New York: Charles Scribner's Sons, 1934. 

Lee, J. M., and Segel, D., Testing Practices of High School Teach- . 

ers, U. S, Office of Education, Bulletin No. 9, 1936. 

Lindquist, E. F., in Hawkes, H. E., and others, The Construction 

and Use of Achievement Examinations, Boston: Houghton Mifflin 

Company, 1936, pp. 125-159. 

Meyer, G, “An experimental study of the old and new types of 

examination,” Journal of Educational Psychology, 25 : 641-661 

(1934), and 26 : 30-40 (1935). 

Odell, C. W., Traditional Examinations and New Type Tests, 

New York: D, Appleton-Century Company, Inc., 1928. 

Progressive Education Association, Evaluation Materials Developed 

lor Various Aspects of Thinking, Chicago: Evaluation in the Eight 

Year Study, Progressive Education Association, 1939. 

Ruch, G, M., The Objective or New-Type Examination, Chicago: 

Scott, Foresman & Company, 1929. 

Sheppard, E. M., “The effect of the quality of penmanship on 

Brades,” Journal of Educational Research, 19 : 102-105 (1929). 

Starch, D., and Elliott, E. C., “Reliability of the grading of high 

school work in English,” School Review, 20 : 442-457 (1912). 

Terry, P. W., "How students review for objective and essay tests," 

entary School Journal, 33 : 592-603 (1933). 

Tyler, R. W., Constructing Achievement Tests, Columbus: Ohio 

State University, 1934. 

Vernon, P, E., The Measurement of Ability, London: University of 

don Press, 1940, 


CHAPTER IX 


Constructing Sbort-Answer Tests 


OUR CONSIDERATION OF THE FIRST FOUR QUESTIONS HAS LED TO THE 
conclusion that in the vast majority of situations the teacher should 


use a shortanswer test constructed by himself for the purpose of . 


evaluating the pupil's achievement of instructional objectives. We 
are therefore confronted with the following question: 

5. If a teacher-made short-answer test is chosen, what types of 
questions should be used, and how should they be apportioned and 
composed? 

‘The terms in which the answer to this question will be expressed 


Tanie or Srectrications 
The table of specifications draws its material from the statements 
of instructional objectives whose nature, form, and content were 
146 


-— 


Constructing Short-Answer Tests 147 


described in Chapter II. It differs from the statement of objectives 
in that it is designed specifically to aid in test construction, rather 
than to serve as a guide for instruction. Furthermore, the table of 
Specifications for an achievement test will contain only that part of 
the statement of objectives which can be evaluated by its means and 
will omit those parts whose achievement cannot be evaluated by 
Means of a device involving language in the form of a teacher-made 
shortanswer test. The table of specifications should contain some 
indication of the relative emphasis or importance attached to each 
of the subdivisions of the instructional objectives, in terms either of 
the time allotted to each subdivision or of its significance in rela- 
tion to other important objectives, The subdivisions of the statement 
of objectives should, of course, be made on more than one basis so 
that no important points of view will be omitted or unduly em, 
sized. For a history course, subdivisions would usually be made in 
terms of chronological periods, types of subject matter (social, politi- 
eal, or economic), types of mental processes (interpreting data, 
applying generalizations, recognizing cause and result relationships), 
any other bases considered important according to the statement 
‘of instructional objectives, For an algebra course, a table of specifica- 
might be subdivided into vocabulary, types of computations, 
types of problems, familiarity with formulae, and so forth, That is, 
‘the construction of the test should be checked for representativeness 
With the proper emphasis on each of the subdivisions arrived at by. 
each of these types of classification. 
is procedure may be further illustrated in terms of the first 
Objective listed by Tyler for an elementary course in zoology: 
P +a fund of information about animal activities and structures, 
specific facts and general principles which the student should be 
to recall without reference to textbooks, or other sources of in- 
"mation," The activities and structures here mentioned could be 
Slassified under such headings as "reproductive, habitation, food. 
Betting activities,” and “respiratory, digestive, circulatory, and 
Structures.” Test content would then be compared with each 
subdivisions to insure that the proper emphasis has been 


ing cach item as it is prepared in the appropriate sub- 
in the table of specifications it will be possible to maintain 


148 How to Evaluate 


the distribution of the items in accordance with the emphases de- 
termined by the statement of objectives. 


Composine Test Irems 


The next step is to compose the individual items, Short-answer 
test items can take any one of a large number of forms, such as 
true-false, multiple-choice, completion, matching, and many others. 
Which one should be used for testing the particular item of content 
or objective is a matter of psychological insight into the mental 
operations required. But the fitness of any form of test item for 
testing a particular objective or mental operation is a function not 
only of the form of the item but also of the way in which it is 
applied. That is, true-false items may test merely memory but they 
can also be designed to test interpretation, organization, ability to 
infer, or any other mental process. Similarly, multiple-choice items 
will vary in what they test with the specific words or phrases which 
go into them. Consequently, the teacher must not only acquire 
familiarity with the different types of test items but also learn to 
make each form serve specific purposes in specific situations. Skill 
and ingenuity must then be exercised while he proceeds to compose 
individual items for each specific subdivision of the table of specifica- 
tions. No attempt should be made in advance of actually composing 
the items to determine what types should be employed. 

Before we proceed to a discussion of the form of short-answer test 
items, it will be profitable to consider some of the principles which 
apply to all of them. Some of these principles are merely admoni- 
tions to use common sense, but they are justified by the frequency 
with which teacher-made tests violate them. Others reflect research 
and experience accumulated by test experts and would not be 
apparent to newcomers in the field of testing. 

1. Avoid obvious, trivial, meaningless, and ambiguous items. 

2. Observe the rules of rhetoric, grammar, and punctuation, 

3. Avoid items which have no answer upon which all experts will 
agree. 

4. Avoid “trick” or “catch” items, that is, items so phrased that 
the correct answer depends upon a single obscure key-word, to 
which even good students are unlikely to give sufficient attention. 

5. Avoid items which contain “irrelevant cues.” These are items 
whose phrasing is such that the correct answer may be determined 


Constructing Short-Answer Tests 149 


ely by the exercise of intelligence without real Possession of 
achievement at which the item is aimed. The kinds of 
elevant cues which may occur vary with different types of test 
ms and will be discussed specifically in connection with each 
- Illustrative of an irrelevant cue which can occur in all types 
the following item: “Man is an (1) plant, (2) reptile, 
) animal, (4) bird.” The irrelevant cue here is the article “an,” 
hich indicates that the correct answer must begin with a vowel, 
(This is perhaps not the only flaw in this item.) ^ 
"Avoid items which furnish the answers to other items, since one 
the items will be rendered useless for evaluation purposes by 
s mistake. 
quire all pupils to take the same test, and permit no choice be- 
een items. Pupils cannot be compared with one another unless 
all take the same test. 
will be useful to note the frequency with which teachers of 
Qus subjects use various types of test items. Table 4 shows the 
obtained by Lee and Segel in their study of the testing 


4— Percentages of Teachers in Each Department Who Use Various Kinds of 
Questions in the Tests They Construct Themselves 


(From Lee and Segel, 5 : 7) 


One- | Prob- 
word | lems 


150 How to Evaluate 


practices of high school teachers. While the percentages presented 
in this table cannot indicate the ideal practices, they should furnish 
the reader with an appreciation of the variety and frequency with 
which the various types of items are used by teachers in the various 
fields of subject matter. It should be noted that for the total group 
the completion type ranks first in frequency of use, followed in 
order by the true-false, one-word, multiple-choice, essay, problems, 
and matching types. These rankings differ, of course, from one sub- 
ject to another. The rankings of the different types have furthermore 
probably changed since the advent in 1935 of the International Test 
Scoring Machine which is described.in Chapter XX. The increasing 
use of this machine for scoring teacher-made as well as commercially 
available tests is moving the multiple-choice type into first ranking 
and forcing out of existence the completion type, which in its usual 
form cannot be scored by machine. 


Types or Test Irems 


Two classes of short-answer test items may be distinguished on 
the basis of the type of thinking required—tecall and recognition. 
Recall consists in reproducing material that has been learned; the 
pupil is given a question or incomplete sentence which can be 
answered or completed only by his furnishing the correct material. 
Recognition consists in distinguishing the correct alternative from 
among the “more than one” alternatives furnished by the question 
or item. As we stated above in another connection, these classes do 
not inherently differ from each other in value for testing different 
subject matters or mental processes; rather, their value depends upon 
the skill with which they are applied. Consequently, in selecting 
from among these types the teacher must determine the fitness of 
each type for each specific situation rather than depend upon 
generalizations for all situations. What is tested by a specific type of 
test item depends not only upon what it gives to the pupil and what 
sort of response it requires of him but also upon the way in which a 
particular material has been taught in class. Thus questions which 
may require interpretative thinking on the part of the pupils in one 
classroom may require mere rote memory on the part of the pupils 


in a classroom in which the desired response has been presented 


explicitly. 


Constructing Short-Answer Tests I51 


In constructing test items the teacher must therefore ask himself 
such questions as "What type of test items—true-false, completion, 
multiple-choice, etc.—is best suited to evaluate the French vocabulary 
of my pupils in the light of the way in which vocabulary has been 
taught to them?” “What type of test item is best suited to the evalua- 
tion of my pupils' ability to reason arithmetically?" *What type of 
test item is best suited to evaluate my pupils’ understanding of the 
relative importance of the various causes of the American Revolu- 
tion?" In answering these questions he will employ as much psycho- 
logical insight as possible into the teaching situation, the pupils’ 
learning, and the relationship of both to the instructional objectives. 

Under recall may be listed the following types of short-answer 
test items: 


1. Simple recall 
2. Completion 
Under recognition may be listed the following types of short 
answer test items: 
3. Constant alternatives 
a. Two constant alternatives, such as true-false, yes-no, right- 
wrong, synonym-antonym 
b. Three constant alternatives, such as true-false-doubtful, 
true-false-converse (for mathematics especially) 
C. Constant alternatives with corrections 
d. Constant alternatives based on a given topic or body of 
material 
€. Constant alternatives with inferences 
f. Constant alternatives with qualifications 
8. Constant alternatives with diagrams 
h. Check list 
i. Master list 
4. Changing alternatives 
a. Two or more changing alternatives 
b. Worst answer 
€. Common principle or most inclusive 
d. Most dissimilar 
€. Result from among causes 
f. Cause from among results 


152 How to Evaluate 


5. Matching 
a. Compound matching 
6. Analogies 
7. Rearrangement 
a. Chronological order 
b. Logical order 
c. Ranking 
d. Pied outlines 

Let us now take up each of these types of short-answer items in 
turn, with definitions, illustrations, discussion of advantages and 
limitations, and suggestions for using it in the best way. 

1. Simple-recall Items.—Simple-recall items are usually in the 
form of a direct question, a specific direction or stimulus to which 
the pupil responds with one word, a single phrase, or a single 
sentence. Illustrations: 


What are the two main gases found in air? 
Who was the first President of the United States? 
Explain in one sentence or phrase what connection each of the follo 
ing people had with Abraham Lincoln: 
1. Stephen A. Douglas . 
2. Chester A. Arthur . 
3. Ulysses S. Grant 
4. Dred Scott 


The advantages of this type are: (a) It eliminates almost entirely 
the possibility of guessing. (b) It constitutes a “natural” form of 
questioning, requiring little adaptation by the pupil. (c) It is easily 
prepared. (d) As was seen in our discussion of the relative values 
of essay and short-answer tests, recall tests serve as motivators of 
better learning and the development of the ability to recall. 
(e) Mention should be made of the particular usefulness of this 
type of item in problem situations in mathematics and the physical 
sciences, where the result of a complex reasoning and computational 
process can be expressed in a few symbols. 

Among the shortcomings of the simple-recall type of test item are: 
(a) Its scoring is not as completely objective as is that of recognition 
types of test items. Almost inevitably certain pupils will make re- 


Constructing Short-Answer Tests 153 


sponses that require expert judgment to distinguish them from the 
correct response included in the scoring key. (b) It requires more 
writing and more time per item than do recognition types. This is 
not a serious limitation because the difference is not large. (c) In 
subjects other than mathematics and the physical sciences the simple- 
recall item tends to become too much a matter of identifying, 
naming, and associating facts; interpretative, inferential handling 
of complicated concepts may be slighted. As will be noted below, 
the simple-recall type is unfitted for testing understanding of 
definitions. 
Suggestions for the Improvement of Simple-recall Items. (a) Short, 
definite, clean-cut answers should be required. Only in the degree 
» to which this suggestion is followed will objectivity of scoring be 
possible. Example: 


Faulty: What kind of a process is evaporation? .......... 7 
Improved: To what does a liquid change when it evaporates? . 


(b) If several correct answers (e.g., synonyms) are possible, each 
should be considered correct. Example: 


What method is used to preserve meats which are transported over 
long distances? . o1 AE SODIUM E N A ER 


Here the correct answer may be any one of the following: “refrigera- 
tion,” “cooling,” “icing,” “canning,” “cold-packing,” “salting,” “pick- 
ling,” or “smoking.” Provision should be made in scoring this item 
for counting each of these responses as correct or the item should be 
changed so as to restrict the correct answer to only one of these. 
. (c) Spelling should be either disregarded or given special attention 
in the form of a separate score for spelling correctness. Accuracy in 
spelling should usually be distinguished from knowledge of the 
Correct response as a separate instructional objective and therefore 
should be separately evaluated. 1 

(d) Minimize the use of textbook expressions or stereotyped 

Nguage in phrasing the questions. Such phrasing rewards and 
Motivates rote memorization of textbook materials, which is not 
me associated with a real understanding independent of terminol- 

By. 


»« 


154 How to Evaluate 


(e) Specify the terms in which the response is to be given. Failure 
to observe this rule will result in responses all of which may be 
correct while still evading the issue at which the item is aimed. 
Example: 


Faulty: Where is the world’s tallest building located? ...... x 
Improved: In what city is the world's tallest building located? .. 


In the first case the correct answer could be anything from "North 
America" to "the Atlantic seaboard.” The second example forces 
the pupil to face the issue squarely. 

In this connection we include a reminder that with computational 
problems it should be indicated to the pupil whether or not "units" 
are to be given with his answer. Example: 


What is the density of a piece of metal having a volume of 100 cc. and 
weighing 180 grams? 


Answer 


answer 


The second and third ways of providing for the pupil's answer are 
obviously preferable to the first. In the case of the third line explicit 
provision should be made for scoring or disregarding the “unit” 
part of the answer. 

(£) In testing for a knowledge and understanding of definitions 
it is often better to provide the name and require a definition than 
to provide a definition and require the name, because a higher level 
of achievement is thereby required. Since definitions cannot usually 
be phrased in the form of a “short answer,” it is obvious that the 
simple-recall type of test item is largely unfitted for testing under- 
standing of definitions. Examples: 


Faulty: What is the general term for vertebrates which suckle their 
young? 
Improved: Define “mammal.” .... 


Other types of test items, such as the multiple-choice, are probably 
better adapted to testing this kind of achievement. 


Constructing Short-Answer Tests 155 


(g) Direct questions are probably preferable to incomplete de- 
clarative sentences, especially for younger, less “test-sophisticated” 
- pupils, since the former are more similar to the forms in which 
ordinary discourse is carried on, Example: 


Faulty: America was discovered in the year ............ re 
Improved: In what year was America discovered? ...... ^ 


(h) Hints concerning the correct answer, in the form of the first 
letter of a word, or a number indicating the number of letters in a 
word, should generally not be employed. Such hints may tend to 
‘confuse pupils when the answer upon which they have decided, 
although it is a correct synonym, does not coincide with the given 
hint. Guessing and responses to superficial cues may also result 
from this practice. 

(i) The position of the space for the response should usually be 
at the right of the question, for it has been found that most pupils 
prefer this position. This preference is probably due to the fact that 
right-handed pupil will not have to cover the question while 
iting his answer. 

(j) The amount of space allowed for the response should be suf- 
ficient to provide for legible writing. Account should be taken of 
te usually larger handwriting of younger pupils in arranging this 


pace, 
(k) Arranging response spaces in a column at the right-hand 
hargin of the page makes the scoring process more convenient. 

2. Completion Items.—Completion items require the pupil to 
"Write in" a word which has been omitted from either an isolated 
statement or connected discourse. Examples: 


Boule is a unit. of (1.1... 6o utes eere 4 
Post automobile engines are ....... cycle engines. 

He mother called to her child, “Please go to the store and 
ine some apples," 

De greatest nation of the western hemisphere is e 
the head of which is the „sussen , who exercises the chief 
power of the government but whose power is NU 

,an 

„ all of which carry 


156 How to Evaluate 


Since the advantages and limitations of the completion type of 
item are so similar to those of the simple-recall type, they will not 
be repeated. 'The completion type need not be restricted to the test- 
ing of rote memory, mere fact-finding, verbal associations, the 
knowledge of unique phrasing, or general intelligence. If applied 
with real ingenuity it can be designed so as to test the understanding 
of a complete thought and to encourage the integration of ideas. 

3. Constant-alternative Items.—a. Constant-alternative items are 
usually in the form of statements concerning whose truth or falsity 
the pupil is required to make a judgment. Examples: 


Directions: For each of the following statements encircle True if the 
statement is true and False if the statement is false. 


The Crusades were successful in converting many Mo- 


hammedans to the Christian faith, ................ True False 
The head of a newborn baby is about one-fourth of the 
MOL body lengthy «5 Yee queo rer ERIGI: Lohn True False 


b. The number of alternatives furnished may be increased to three 
by permitting such answers as doubtful, can't say, or can't tell, in 
addition to true or false. Another three-alternative type is that in 
which the pupil is required to say whether two variables are posi- 
tively related, negatively related, or unrelated. Example: 


Directions: Below is given a list of variables influencing test reliability. 
After each variable write: 


+, if reliability is positively related to it. 
—, if reliability is negatively related to it. 
o, if reliability is not related to it. 
ij Number of items, or ia osa cians a ir e Defeat oaa. oie 
2. Range of items difficulty. ......... 
3. Interdependence of items. ........... 
4. Time allowed with fixed number of items. 
%. Reliability o£ criterion, 5.0.20 coi iesus lin eee als 


The two-alternative form may be further varied so as to require the 
pupil to indicate not only whether the item is true or false but also 
whether the converse is true or false. Examples: 


Constructing Short-Answer Tests 157 


Directions: Below are statements for which you are to encircle the 
two correct answers from among the following: 


T: The statement is true. 

F: The statement is false. 
CT: The converse is true. 
CF: The converse is false. 


A cube is always a rectangular parallelepiped. .... T F CT CF 
Two dihedral angles are equal if their plane angles 

Breequal. .......:. ^ AEE a LLL T F CT CF 
All vertebrates are mammals. ............+++++++: T F CT CF 


c. Another variation of the constant-choice item is that which 
requires the pupil to make the proper changes in a single word 
80 as to make false statements correct. Example: 


Directions: For each of the following statements if the statement is 
true, underline the word True. If the statement is false, underline the 
word False, and write the substitute for the underlined word necessary 
to make it true. 


Nitrogen supports combustion. .......+-. True False O: 
Water is a chemical element. ....... True False 


NaCl is the symbol for common salt, ...... True False he 


d. True-false statements may be based upon a given topic, para- 
graph, or body of material. Example: 


The Constitution of the United States sets forth: 
1. The exact number of justices necessary for the Supreme 


Court, ooe 4 C ee Lee Yes. .No 
2. The exact number of Cabinet officers for the Presi- 
dents: Cabinets: a4 Sey eset Yes No 


3. The age at which one may exercise the right of suffrage. Yes No 


4. The President as commander-in-chief of the Army and 


Nays |, eet ep ee ees Yes No 


5. An exact minimum age for Senators and Representatives Yes No 


158 How to Evaluate 


e. The pupil may be required to indicate whether a statement is 
true or false and then to judge the truth or falsity of inferences 
drown from the statement which he marked true. Examples: 

1 The President of the United States is commander-in- 
chief of the Army and Navy. ...... see True False 

a. The President of the United States can declare war 
whenever he sees fit. 4... True False 

b. The President would be the commander-in-chief 
of a separate air force. ..... sse True False 


f. True-false items with qualifications require the pupil to indicate 
whether the item is true or false or whether it can be made true 
only by adding certain qualifications to be given or selected by him. 
Examples: : 


Statements 


1. The President has the power to appoint justices of the 
Supreme Court. a...se seee ane en] TR me 


2. The President can propose taxation measures. : T: Poa 
3. The President may appoint ambassadors to other coun- 

a A T TE I mq M e vecriaise', T Rae 
4. The President may appoint Senators-atlarge. ...... TF 


Qualifications 
1. With the consent of the Senate 


2. With the consent of the House of Representatives 
3. No qualification 


g. True-false items with diagrams present to the pupil a series of 
diagrams concerning which he is to make one of a small number 
of possible judgments. Diagrams of the Wheatstone bridge for 
measuring electrical resistance have been used in this manner in 
physics courses, the pupil marking each diagram with a plus sign 
if it was balanced and with a minus sign if it was unbalanced or if 
the wiring diagram was incorrect. Diagrams of tools, geometrical 
figures, housing plans, and similar materials can all be treated in 
this way. 

Advantages and Limitations—The merits of the constant-choice 
type of items in its most used form, the true-false test, can easily be 


| Constructing Short-Answer Tests 159 


| outweighed by its limitations unless far more care than usual is 
| — exercised in constructing it. (1) One type of achievement to which 
| itis well adapted is the ability to distinguish popular misconceptions 
||. and superstitions from scientifically validated truths. For this kind 

of material the true-false test provides a mode of presentation that is 

similar to an actual life situation, in which the pupil must make 
| some judgment concerning the truth or falsity of statements occur- 
ring in the layman's literature on any subject. (2) Another type of 
material for which the true-false item is peculiarly well suited is 
that in which the material does not lend itself to the construction 
of more than two or three plausible alternatives. An example of this 
is the following item: “Emergency exits from office buildings 
should open outwardly.” Here there are only two possible choices, 
inwardly or outwardly, so that the construction of a multiple-choice 
item would lead to absurdities and phrasing this question as a 
completion item, “Emergency exits from office buildings should 
open _______.,” would leave too indefinite the terms in which 
the answer was expected. (3) Perhaps the major advantage of true- 
false items is that they enable the teacher to evaluate a relatively 
large sample of subject matter in a short period of time, since 
Pupils can generally answer more of this type of item per unit of 
time of other types. 

It is this consideration which has led to its major disadvantages 
as used by many teachers. (1) Its use as a time-saving device has 
often been accompanied by hurried, careless, mechanical construc- 
tion so that this type of item has perhaps been more abused than 

any other type. Teachers have lifted sentences out of textbooks to 
Provide true statements and inserted “nots” to provide the required 
- Number of false statements. Consequently, the number of pre- 
Cautions based upon technical analyses of true-false tests constructed 

large numbers of average teachers is probably larger for true- 

1 alse items than for other types. These will be considered in our 
discussion of how to construct true-false tests; here our point is 
t true-false items require greater care and thought upon the 
leacher's part than is commonly given to them. (2) Apart from this, 
Me true-false test of itself tends toward greater ambiguity than do 
Other types, (3) Finally, the true-false item can be guessed more 
-fasily by pupils than can other types, and consequently requires 


should be approximatel: to the number of false statements, 
(2) Fu yrenrevur f i oy mme ia Sa 


Faulty: The President of the United States, a naturalized 
citizen, holds office for four years in a single term. 


; 
i 
H 
à 
i 
E 


f 
i 
i 
H 
H 


i 


l 
: 
: 
l 
i 
i 
B 


i 

il 
tH 
HH 
TAS E 
Hd 
iu 


i 

| 

f 

E 

L4 

i 
iii 


K 
- 


a text of what it appears to teu, namely, understanding 
reason or awe. Example: 

Faulty: There Neate desse tiscm. the Ameri 

di onion, dem di were ved ta aosa of ngon. XE 
Improved: There wae mach damemion among the colo 
mints, because (either correct ot incorect reson)... T P 


la the “faulty” illustration, the statement appears knowledge 
dda sema vidis mardy idag fact; ia de apina? hansen 


Constructing Short-Answer Tests 164 


ems d 
really tests, namely, knowledge of a reason or cause, 

(3) Avoid "etie determiners,” that is, words or modes of ex- 
premion that are usually associated with a true statement or a false 
nt. Analyses of large numbers of teacher-made trucfalse 


“only,” “alone,” “all,” “no,” “non - “nothing,” "always, , 

" or “reason,” are false. On the other hand, statements con- 

ing “sh ld,” "may," "most," "some," "often," or "generally" 
*r i 


to convert true statements into false ones by using "not," 
“all,” etc, then these words will cease to be specific de- 
ncrs. Specific determiners are undesirable, of course, because 
provide clues which enable pupils to respond correctly with- 
any real knowledge or understanding of the response, 
(4) Ambiguity can be reduced by the use of quantitative rather 
qualitative language wherever possible, Direct comparisons, 
the same reason, should be used wherever a quantitative 


: Willkie received fewer than fifteen million. votes in 
ection of 1940. wits 1 "TTE. 


Simple, non-technical language. Examples: 

t Only after having analyzed his problem and set vp an 
thesis, should the experimenter, who should be famdiar 
the literature in his field, decide upon an experimental 


TA 


Y62 How to Evaluate 


Improved: An experimenter should know what others have done 
in his field before deciding upon a technique to test his hy- 
DOLDEI Ue aic ses T ees pln qui ee LA GA Tiare ah E TANSY TR 


Faulty: There was a marked eflorescence of inventive activity 
inte nineteenth century. 22... aA e, "USES 
Improved: 'There was a marked increase in inventive activity 
during the nineteenth century... TF 


(6) Instructions against guessing on true-false or other two- 
alternative tests have been used by some test technicians and avoided 
by others. Such directions may be considered desirable on the ground 
that they decrease the element of chance success which enters into 
responses on recognition tests. Consequently it is reasoned that 
such directions should result in a more valid measurement of actual 
achievement. On the other hand it is argued that they can never 
achieve their purpose because pupils do not know when they have 
complete knowledge of the correct response, no knowledge, or only 
partial knowledge. Many guesses are not "pure" in the sense of 
being based on no knowledge whatsoever. Rather, students may 
frequently respond correctly on the basis of partial knowledge or 
"hunches." 

Furthermore such directions usually introduce a "personality," 
or non-intellectual, factor into the measurement of achievement. 
This is the pupil's tendency to gamble, or his willingness to respond 
on the basis of partial knowledge rather than only when completely 
certain of the correct response. Pupils have been found (3, 15) to 
differ in this personality trait; instructions against guessing will 
make the test scores depend on both intellectual and “personality” 
factors in unknown proportions. 

A similar argument opposed to such instructions is the need for 
requiring all the pupils taking a test to “run the same race.” Only 
when all are required to respond to all the items can constancy of 
the measuring instrument be obtained. Furthermore, instructions to 
answer all items elicit a greater part of the pupil’s real achievement 
for it has been found (14) that guessed answers are more often 
correct than incorrect. If pupils were instructed to withhold re- 
sponses at which they considered themselves to be guessing, an 
appreciable part of their achievement would be unrevealed and an 


Constructing Short-Answer Tests 163 


unfair advantage would be given those who were willing to attempt 
a response on the basis of relatively uncertain knowledge. 

It has been argued that instructions to answer all items may lead 
pupils to lose respect for tests and to regard them as guessing games. 
Whatever effect of this kind such instructions may have can prob- 
ably be counteracted quite readily if teachers will explain the purpose 
of these instructions in simple terms. 

In summary, the writers urge that no instructions against guessing 
be used with two-alternative recognition tests, or with recognition 
tests presenting more than two alternatives. Rather, pupils should be 
instructed to answer every item without omissions. 

(7) A further issue in the use of recognition tests, especially those 
presenting only two alternatives per item, is the desirability of a 
scoring formula intended to correct for chance successes, or guessing. 
The general and two-alternative forms of the scoring formula used 
for this purpose are as follows: 


General Form Two-alternative Form 
(x) s=T— ——(w—0) S=T—2W-O 
(2) s=r- -= S=R-—W 
@) s=R42 s=R+2 


where $ = score 
T = number of items in the test 
R — number of right responses 
W — number of wrong responses 
O — number of omitted responses 
7 — number of response alternatives presented in cach item 


These three formulae, while differing in the kinds of counting re- 
quired, yield scores that correlate perfectly with one another. The 

d formula, which requires that R’s and O's be counted, is more 
acceptable to pupils because it gives full credit for right responses 
and an additional fractional credit for items omitted rather than 


164 How to Evaluate 


All these formulae are based on the assumption that guessed 
responses in two-alternative tests will be right as often as wrong. 
This assumption is justifiable only on the average rather than for any 
individual pupil. In so far as it is not justified, the formulae over- 
or under-penalize wrong responses. 

When the test is of such length and the directions are so phrased 
that every pupil responds to every item, the formulae need not be 
used since in such a case there is a perfect correlation between simple 
(number right) scores and corrected (e.g, R — W) scores. If the 
test maker follows the recommendation in the preceding discussion, 
that such instructions and test lengths be used, the formulae for cor- 
recting for chance success will seldom be required. 

h. Before proceeding to the next class of test items we should deal 
with the constant-alternative types known as the “check list” and 
the “master list." The check list requires the student to go through 
a list of phrases or statements and mark each one which fulfills 
certain requirements set up in the directions. Example: 


Directions: Place a check in front of each of the following which is 
part of an airplane: 
v/. aileron s. Windbeacon 
"M rudder ., Strut 
gear shift propeller 
mast n periscope 
stabilizer d V stick 
accelerator Y k helium bag 
throttle ripcord 


It is evident that the check list is almost indistinguishable from the 
true-false item with respect to scoring, advantages, and disadvantages. 

i. The master list requires the pupil to select from a constant 
group of alternatives, usually more than two, the one which best 
applies to a group of items. Example: 


Directions: From the list at the top select the kind of vote required 
by the phrases below and place its number on the line at the right. 
1. Majority 
2. Two-thirds majority 
3. Three-fourths majority 
4. Unanimous 


Constructing Short-Answer Tests 165 


Number of jury votes necessary to convict a criminal. 


Number of votes necessary to expel a member of either House 
ithe United States Congress. a arcat. esce Eee eese. 


Number of members necessary for a quorum of each House. 
‘Number of votes needed to pass a bill over the President's 


] imber of votes necessary to nites amendments to the 
United States Constitution. t Jue i 
Number of state legislatures that must raiya an atendent 
before it becomes valid. is 
will be seen later, the master list differs eri m sting n 
that the numbers of items contained in the two groups of the 
aster list are much more disproportionate, so that items from one 
the lists must be used more than once. The master list differs 
m the multiple-choice form in that the alternatives remain the 
ne instead of changing for each question. 
- Changing-alternative Items.—a. Changing-alternative items re- 
ire the pupil to select the best one of a group of several alternatives 
lich change with each item. The most usual form of this kind is 
multiple-choice item. Example: 
i A fuse is placed in an electric current to (1) measure the 

rent, (2) affect the current, (3) tiie excessive flow of 
urrent, (4) lower the resistance. Me HISP ei ete 
Although the variety of bases upon Pam the A can be i in- 
d to select the correct answer is almost infinite, only a few 
them are used with any frequency. Among them are: the best 
ver, the correct answer, the worst answer, the least satisfactory 
Ver, the most inclusive term, the most dissimilar term, the 
Írom among causes, the cause from among results, etc. 


imples: 


Worst Answer: 


tions: Of the four alternatives presented as completions to each 
ement, choose the worst and write its number in the space at the 


i osceles triangle is one in which (1) two sides are equal, 
1) two angles are equal, (3) one of the altitudes is perpen- 
ular to its side, (4) the base equals the altitude. ........ —. 4. 


166 How to Evaluate 


c. Most Inclusive Answer: 


Directions: Following are sets of four words or terms, one of which 
includes the other three. You are to select the inclusive term from each 
set and write its number in the space to the right of the set. 


(1) smallpox, (2) tuberculosis, (3) scurvy, (4) disease. ...... 
(x) tapeworm, (3) typhoid bacillus, (3) parasite, (4) dodder. .. 


d. Most Dissimilar Answer: 

Directions; Following are sets of four words or terms, one of which 
does not belong with the other three, Select the word or term which 
does not belong and write its number in the space to the right of the set. 


(1) Abraham Lincoln, (2) Ulysses S. Grant, (3) Stephen A. 
Douglas, (4); Chester A. Arthur... lueret 
(1) shark, (2) ape, (3) giraffe, (4) kangaroo. ...........+4. 


e. Result from Among Causes: 

Directions: In each group of four events given below there is one 
result together with three causes which contributed to bringing about 
this result. Select the result and write its number on the line at the right 
of the item. 


(1) Sinking of the Lusitania, (2) unrestricted submarine war- 
fare, (3) declaration of war in 1917, (4) invasion of Belgium, 3, 


f£. Cause from Among Results: 

Directions: From the following groups of four historical events or 
conditions, select the one which may best be considered to be a cause 
of the other three, 


(1) English religious intolerance, (2) settlement of Jamestown, 
(3) American freedom of religion, (4) sailing of the May- 
(IL EAE ES Mc erento De rs pee ree e eee ix 


The number of alternatives presented can vary from two to any 
larger number, but the most frequently used numbers are four and 
five. 

Advantages and Limitations.—(1) The changing-alternative type 
of test item or the multiple-choice item is better adapted to testing 
the higher mental processes, such as inferential reasoning, and fine 


discrimination, Not only these higher levels of achievement can be 
/ 


Constructing Short-Answer Tests 167 


approached by it, but also all the lower levels down to the rote 
memorization of isolated facts. It is the most flexible type of test 
item available for varying types of mental process according to 
specific kind of subject matter. (2) The multiple-choice item is 
frequently preferable to the simple recall when the correct response 
is lengthy or involved or can be written in several correct forms. 
(3) In proportion as it has numbers of alternatives greater than 
two, the possibility of guessing the correct answer is less than in the 
true-false test; the greater the number and plausibility of these 
alternatives the less chance there is for a guessed answer to be 
correct. 

Disadvantages of the multiple-choice test item are: (1) It is much 
more difficult to construct well than are other forms of test items. 
The necessity for devising plausible incorrect alternatives places a 
heavy burden on the ingenuity and psychological insight of the 
teacher. It might well be said that constructing a single multiple- 
choice item with four alternatives requires an amount of labor 
equivalent to constructing four true-false or simple-recall items. 
This objection, however, should tarry weight only with those who 
construct evaluating devices in a hurried, careless manner; it should 
not be considered by those who give evaluation the emphasis it 
deserves. (2) Multiple-choice items require more space on the 
Written page per item than do other types. Consequently, their use 
may be disadvantageous in situations where the teacher must 
Conserve paper or has difficulty in securing facilities for reproducing 
the test by mimeograph or some other device. (3) Multiple-choice 
items require more time per item than do some other types. This 
disadvantage operates in the practical classroom situation so as to 
bring the multiple-choice item closer to the true-false item in net 
total merit. That is, the true-false item, despite its relative unrelia- 
bility due to the guessing factor, is simpler to construct and more 
efficient in terms of space and available time than is the more 
reliable multiple-choice item. The greater reliability of the latter due 
to its larger number of alternatives may thus be counteracted by 
the larger number of true-false items which it is possible to present 
to pupils during a given period of time. The fact remains, however, 

at the multiple-choice item lends itself to the evaluation of higher 

Vels of mental process, finer discrimination, and more varied 


168 How to Evaluate 


types of instructional objectives than does the true-false item. In 
final examination situations where time is less of a consideration 
and the number of multiple-choice items presented may be in- 
creased so as to sample the subject matter adequately, the multiple- 
choice item acquires a distinct advantage over most other types. 
Thus, as is reflected by its preponderance in standardized, highly 
refined tests, the multiple-choice item is perhaps the best available 
all-round short-answer testing technique. 

Rules and Suggestions for Constructing Multiple-choice Items— 
Multiple-choice items consist of an introductory part and of several 
alternatives which may be divided into the true answer and the false 
alternatives or distractors. (1) The introductory part may be in the 
form either of a direct question or of an incomplete statement. 
Many workers consider the direct-question form preferable on the 
grounds that it is more casily phrased, more natural for the pupil, 
less likely to be ambiguous, conducive to greater homogeneity of 
pupil responses, and less likely to contain irrelevant clues to the 
correct answer. On the other hand this form is likely to be slightly 
longer and to require more words than the equivalent incomplete 
Statement form. 

(2) In any case, if the introductory incomplete statement form is 
used it should be meaningful in itself and imply a direct question 
rather than merely lead into a collection of unrelated true-false 
Statements. Example: 


Faulty; The United States of America (1) has more than 
150,000,000 people, (2) grows large amounts of rubber, (3) has 
few good harbors, (4) produces most of the world’s auto- 
Xiobiles, 2 yi ilio ceci techo a ID oe 4. 
Improved: The population of the United States is characterized 
by (1) an increasing birth rate, (2) varied nationality back- 
grounds, (3) its even distribution over the area of the United 
States, (4) an increasing proportion of young people, ...... A. 


(3) The distractors should be plausible, so that uninformed pupils 
will tend to select them rather than the correct answer. Plausibility 
may be attained by making the distractors as familiar as the correct 
answer, related to the same concept as the correct answer, and as 
reasonable and natural as the correct answer. One method of securing 


Constructing Short-Answer Tests 169 


plausible distractors is to use the introductory statement as a 
completion test and then tabulate the incorrect responses of pupils. 
Stalnaker and Stalnaker (12) found that distractors selected in this 
way for a five-choice best-answer vocabulary test were marked more 
often by uninformed pupils than were distractors selected at random; 
since there was no interference by these selected distractors with the 
discriminatory power of the item, their use was recommended. On 
the other hand Kelley (4) concluded that the test maker's judgment 
was about as valid as tabulating the responses from recall tests as a 
method of securing the incorrect responses for vocabulary tests. 
Whatever method is chosen the goal must be the same: plausible dis- 
tractors that will attract uninformed pupils away from the correct 
answer, 

(4) The length of the alternatives—that is, the number of words 
each contains—should not vary systematically with the correctness 
of the distractor. Otherwise pupils may come to learn that the long 
distractors are usually the correct ones, or vice versa. 

(5) The arrangement of alternatives should be as uniform as pos- 
sible throughout the test. If possible, the alternatives should come 
at the end of the statement if the incomplete-statement form of 
introduction is used. If space permits, they should be listed one under 
the other rather than placed in a paragraph, because listed alterna- 
tives are easier to read and consider separately. Example: 
Faulty: The cheapness of land and scarcity of labor in the West 

Created (1) an aristocratic class of landowners, (2) a large 

class of wage-carning men, (3) a system of servitude, (4) a 

large class of small freeholders. 6. ..----.0-00000..00 
Improved: The cheapness of land and scarcity of labor in the 

est created 
(1) an aristocratic class of landowners. ........ 
(2) a large class of small freeholders. ............ 
(3) a system of servitude... 


170 How to Evaluate 


Faulty: The best way to employ leisure time is (1) read good books 
and magazines, (2) movies are usually enjoyable and educational, 
(3) minding your own business, (4) in playing cards and working out 
puzzles. 


Improved: The best way to employ leisure time is to (1) use it reading. 
good books and magazines, (2) go to the movies, (3) mind your own 
business, (4) use it playing cards and working out puzzles. A 

(7) The number of alternatives should be at least four or five if 
possible. However, it should be reduced below this if it is impossible 
to construct others without involving absurdities or obviously false 
distractors. It is unnecessary to maintain a fixed number of alternas 
tives for every item unless a formula to correct for chance guessing. 
is applied to the score. Also, when the total score on a test is come 
puted as a percentage of items correct, the number of alternatives. 
should be the same in all items, 

(8) Corrections for guessing need not be applied in the usual 
classroom testing situation if the items contain more than two alterna- 
tives. In standardized multiple-choice tests where statistical evidence" 
for the equal plausibility of the distractors has been obtained, the 
application of the correction for guessing formula does yield worth- 
while results. The formulae for this purpose are given on page 163. 

(9) When the simple-recall type is more suitable, the multiple- 
choice form should not be used. Such situations may occur (a) when - 
the answer required by the simple-recall item requires little more 
writing than does the multiple-choice item, (b) in computational 
questions where the response is in numerical form, (c) when only - 
two alternatives can be devised that are at all plausible. i 

(10) The level of mental process required by an item is dependent: 
in large part upon the homogeneity of the alternatives presented by. 

. it. Examples: 


Less homogeneous: Which city is nearest to Chicago? (1) Los 
Angeles, (2) New York, (3) St. Louis, (4) Miami, ........ 
More homogeneous: Which city is nearest to Chicago? (1) Min- 
neapolis, (2) St. Louis, (3) Cleveland, (4) Milwaukee. .... 
Less homogeneous: Archimedes’ Law deals with (1) falling 
bodies, (2) liquid-displacing bodies, (3) heated bodies, 
(4) light-giving bodies. 


Constructing Short-Answer Tests 171 


More homogeneous: Archimedes’ Law states that a floating body 
(1) will seck its own level, (2) will displace a volume of liquid 
whose weight equals the body's weight, (3) will receive pres- 
sure from the liquid which is equal in all directions, (4) cannot 
float midway between the surface of the liquid and the bottom 
Bbthe vessel, lan T E 2 


(11) Understanding of definitions is better tested by furnishing 
the name or word and requiring choice between alternative defini- 
tions than by presenting the definition and requiring choice between 
alternative names or words. Example: 


Faulty: Water enters the air by a process called (1) osmosis, 

(2) filtration, ( 3) condensation, (4) evaporation. ete 4. 
Improved: Evaporation is a process by which (1) vapors turn 

into liquids, (2) liquids pass between two porous surfaces, 

(3) solids dissolve in liquids, (4) liquids turn into vapors. .. of 


(12) Distractors can often be made attractive and plausible by 
expressing them in textbook phraseology, the correct response 

ting expressed in original terms. 

5. Matching.—Matching items are related to both the master list 
and the multiple-choice forms, in consisting of two sets of items 
Which are to be associated on some basis furnished by the directions, 
As has been stated, it is only the greater similarities between the 
number of items presented in the two lists to be matched that dis- 
tinguishes matching sets from the master list. Similarly, matching 
Sets are distinguished from multiple-choice items in that the same 
group of alternatives is used in sclecting the answers for matching 
tests, while in multiple-choice tests each question has its own group 
Of alternatives, From these relationships of the matching test to 
other forms it may be correctly inferred that the variety of bases 
Upon which the two lists of items can be matched is almost infinite, 

amples of these bases are: events and dates, events and places, 
events and results, inventions and inventors, books and authors, 
Processes and Products, usages and rules, names and definitions, 
“uses and effects, Similarly, the lists may be presented in many 
varying forms; diagrams, maps, pictures, chronologically or logi- 
Cally arranged items with numbered gaps between them may all be 
Used if they are appropriately labeled and proper directions are 
Even. Example: 


172 


How to Evaluate 


Directions: After each phrase write the number of the abbreviation 
which stands for the appropriate organization. 


Organization 


Abbreviation 
1. AAA, 
N.Y.A. 
CC.C. 
S.E.C. 
CIO. 
T.V.A. 
A.F.L. 


E e E D 


Student aid. 


Soil conservation. 
Industrial union. 
Norris Dam. 


AYEY yn 


Work on forest and park. 
Trade or craft union. 


Directions: After each animal write the number of its gestation and 


incubation period. 


Gestation and Incubation Period 


1. 280-283 days 
330-340 days 
143-150 days 
112-114 days 
67-70 days 
20-21 days 


Sa ee Bs p 


uM 
= 
8 
2 


Animals 


a, A variation of the matching form is compound matching, which 
requires matching three or more lists with one another on indicated 
bases. Thus events can be matched with both places and dates, book 
titles with both authors and themes, processes with both raw material 
and products, etc, Example: 

Directions: After each city write first the number of its state and then 
the letter of its major industry. 


States 
Illinois 
Indiana 
Michigan 
Minnesota 
New Jersey 
New York 
Ohio 
Pennsylvania 
"Texas 


P EN DNPH 


BOBO he 0 ow 


Major Industries 


. Autos 

. Airplanes 

. Flour milling 

. Meat packing 

. Electrical equipment 
. Publishing 

. Rubber 

. Steel 

. Telephones 


Cities 
Detroit .... 
Akron 
St. Paul ... 
Chicago ... 
Schenectady 
Pittsburgh... 


e 


OV REGO Mon 


Constructing Short-Answer Tests 173 


The matching form is frequently modified in various ways. In the 
first place, it is sometimes arranged so that each item in the list of 
alternatives may be used more than once as a response; this makes it 
impossible to answer correctly by a process of elimination. In the 
second place, it may be arranged so as to require the use of more 
than one alternative for some test items; this enables the testing of 
ability to recognize terms which share meanings in common with 
other terms. In the third place, the alternatives may be arranged 
so as to make it easy to find the number of the alternative if its 
meaning has already been decided upon; such arrangements may 
be alphabetical, chronological, or logical, or have other meaningful 
bases, 

Frequently these modifications are combined with the compound 
matching principle; that is, the response alternatives and the test 
items are brought together, or matched, on more than one basis. 
These modifications, which Tyler (13 : 41-52) has discussed under 
the term “master list,” will be illustrated in the following pages. 

Advantages and Limitations of the Matching Type of Test Item — 
Writers have disagreed upon the general merits of the matching 
type of item for testing achievement of the higher mental processes, 
Some consider it poorly adapted for testing understanding of and 
ability to use relatively complex interpretative ideas while others 
consider it adaptable to testing all phases of achievement including 
the broader thought processes. Our own judgment must hinge upon 

conception of “higher mental processes.” If these are understood 
to mean only operations in which a pupil employs explicitly elabo- 
rated concepts, then the matching form must be considered unsuit- 
able, since by its very nature it must always ‘contain in one of its 
Sets a list of short names, dates, concept words, or other highly con- 

ensed symbols. On the other hand, if the higher mental processes 
May be considered as permitting such condensed concepts to be 
used (e.g, transcendentalism, mercantilism, etc.), the matching form 
can be capable of eliciting complex thinking as aimed at by in- 
Structional objectives. Since, in our judgment, “relatively complex 
interpretative ideas” must not of necessity be wordy or lengthy, we 
may conclude that the matching form is adapted to the testing of 
all levels of mental process, : 
€ compactness of the matching form, derived from its use of 
* same response alternatives for a whole group of questions, makes 


£74 How to Evaluate 


it highly efficient in terms of space required and testing time per item, 
It is peculiarly well suited for making a rapid survey of a specific 
aspect of a field of subject matter, such as its leading personalities, 
the time orientation of a group of events, or definitions of basic 
terms. 

Perhaps the chief disadvantage of the matching form is that it 
requires even greater caution and circumspection than other forms 
if it is not to be rendered invalid by irrelevant clues, implausible 
alternatives, and awkward arrangement. These dangers may be 
avoided, however, by close adherence to the rules and suggestions 
in the following paragraphs. 

Rules and Suggestions for Constructing Matching Items. 
(1) Homogeneity of response alternatives, and consequently homo- 
geneity of the items with which these alternatives are to be matched, 
constitute by far the most crucial aspect of a matching set. This 
homogeneity should be present in all respects except the one that 
coincides with the instructional objectives at which the set is aimed. 
For example, if the set is aimed at definitions and distinctions be- 
tween terms involved in the human respiratory system, only such 
terms should be included, not terms referring to such physiological 
groupings as the digestive, skeletal, or nervous systems, Example: 


Faulty: 

1. Auricle 1. Main respiratory organ’ ........... 

2. Lung 2, Major subdivision of windpipe 

3. Trachea 3e Windpipe ena eae, 

4. Vagus 4. Breastbone .............. 

5. Bronchus 5. Cranial nerve for lung 

6. Esophagus 6. Chamber of heart ................ 

7. Sternum 

Improved: 

1. Lung Se URIDINE Roe mer. D PE. Ql 
2. Trachea 2. Subdivision of windpipe .......... E 
3. Vagus 3. Respiratory nerve ............0.05 3 
4. Bronchus 


5. Bronchiole 


Constructing Short-Answer Tests 175 


One test for homogeneity is whether it is possible to label each 
column of the matching set with one term which would accurately 
delimit its content. Perhaps the only way to determine whether the 
- matching exercise really elicits the type of achievement desired is 
to take each question and examine the way in which the correct 
response alternative will have to be selected. Can it be selected 
directly on the basis of real knowledge? Or, if the selection has 
to proceed by elimination of incorrect responses, will such elimina- 
tion be based upon the gross unfitness for matching, either logically, 
grammatically, or generically, of all but the correct alternative? In 
the latter case, where all but the correct response are grossly unfit to 
be matched with the question items, the matching exercise is not 
eliciting the type of achievement desired. 


(2) The chances of guessing the correct response may be reduced 


(a) Increasing the number of response alternatives beyond 
the number of items 

(b) Permitting some of the response alternatives to be used 
more than once in the same matching set 

(c) Requiring more than one response alternative to be 
matched with some of the test items. 


(3) The basis upon which the: matching is to be made should 
Always be clearly indicated. This should be done both in the direc- 


tions and in the headings of the columns of items to be matched, 
mple: 


ucting the pupil merely to match these groups of items with- 
„giving him some basis for doing so will give the teacher no 


cation for considering some responses correct and others in- 
Orrect, 


176 How to Evaluate 


Improved: 
Fibers Test to Identify Fibers 
1. silk 1. Tears easily with a shrill sound and 
2. wool ends of yarn are even and curled. . . | 
3. rayon 2. Tears with difficulty, with dull sound; 
4. cotton leaves irregular edge. ............. ssn E 
5. linen 3. Tears with difficulty, with dull sound; 
leaves ends of yarn straight. ....,., — s 
4. Tears easily with shrill sound, leaving 
threads of uneven length. ......... se 


5. Tears with shrill sound, with threads 
long and uneven in length. ........ 


In this illustration the basis for matching is clearly indicated, the 
pupil knows what is expected, and a definite criterion fo: deter- 
mining correct responses has been set up. 

(4) Single words, short phrases, numbers, and other quickly ex- 
amined types of material should constitute one of the lists of items 
to be matched. Also it is preferable that this list of short terms be 
the list of response alternatives. This requirement, of course, is 
what makes the matching form less suitable for testing understand- 
ing of terms and definitions than the multiple-choice forms, which 
can require the pupil to select from among definitions rather than 
Írom among terms. 

(5) The list of response alternatives should be arranged in some 
order that will enable the pupil to find the correct response quickly — 
once he has decided what he is looking for. Thus dates should be 
arranged chronologically and names should be arranged alpha- 
betically. 

(6) "The number of response alternatives, while preferably greater 
than the number questions, should seldom be greater than ten or 
twelve. Longer lists will require the pupil to spend too much of his 
time in selecting the correct response. The guessing factor is better 
reduced by increasing the homogeneity of response alternatives 
than by increasing their number. 

(7) A single page should contain all of the matching set so that 
the pupil will not be required to look for a response on a different 
page from that which contains the question. 


Constructing Short-Answer Tests 177 


6. Analogies—Analogies constitute not a distinct form of test 
item, but rather a way of putting questions to which any of the other 
forms may be adapted. The pupil is presented with two terms 
whose relationship he must infer. A third term is then given for 
which he must either recall or recognize a fourth term whose rela- 
tionship to the third is the same as that which obtains between the 
first two, Examples: 


Any of the other forms of short-answer test items may be used in 
Conjunction with analogies, The pupil may be required to furnish 
the missing terms as in completion or simple-recall items, The 
Missing term may be selected from a group of changing alternatives 
for cach analogy, as in multiple-choice items. A list of analogies 
may be presented with a list of response alternatives, as in match- 
ing items. Complete analogies may also be furnished, the pupil 
being required to make a judgment concerning the truth or falsity 
of the analogy, as in true-false items. 

Advantages and Disadvantages of the Analogies Form.—Intel- 
igence and mental abilities tests have made more use of analogies 
than have achievement tests, probably because the artificiality of 
the analogies form requires more general mental ability for a cor- 
rect adaptation or response than specific achievement of instruc- 
tional objectives, Analogies do, however, constitute a brief compact 
Way in which to put a question. If pupils can be made thoroughly 

miliar with this form so that it loses its artificiality for them, it 
can become an efficient testing device. The advantages in brevity 
of the analogies form may be scen in the following examples: 


Oxygen :O :: chlorine : Cl 

instead of E 

What is the chemical symbol for chlorine? 

Animals + oxygen :: plants : carbon dioxide 

trad og.) A wire eain n s 

What Eas is indispensable in the respiration of plants? 


178 How to Evaluate 


Rules and Suggestions for Constructing Analogies Items.—The 
precautions to be observed are the same as those necessary in con- 
structing the type of item to which the analogies form is applied, 
"That is, if analogies are simple-recall items, the relevant suggestions 
for constructing simple-recall items should be considered. Perhaps 
the only rule unique to analogies is that the directions, especially 
for younger, less "test-wise" pupils, should be especially clear, ex- 
plicit, and complete. Practice exercises and close observation by 
the teacher are usually necessary to insure that all the pupils will 
understand what the analogies form requires of them. 

7. Rearrangement.—Rearrangement items require the pupil to 
put into some specified order a series of randomly presented mate- 
rial. The specified order may be of any kind, such as chronological, 
difficulty, importance, length, weight, logical order, etc. Examples: 

a. Chronological Order: 


Directions; Given below are groups of three events or men whose 
numbers are to be written in chronological order in the space at the right 
of the items, 


1. were Presidents of the United States... (1) Wilson, 


(2) Lincoln, (3) Washington. ...............0c0e¢0 005 3,2 
2. were religious groups who settled in America... 

(1) Catholics, (2) Quakers, (3) Puritans, ............... 3142 

b. Logical Order: 


Directions: A greenhouse catches and holds much of the heat radiation 
from the sun. Place a cross (x) before the statements below which help 
to explain this phenomenon. (Some of the statements are false; some 
of them are true but do not apply.) * 


ipis. 1. The glass transmits the longer heat waves more readily than 
the shorter heat waves. 

LUNA 2. Objects inside the greenhouse, once warmed, radiate very 
long heat waves. 

jd 3. The heated glass radiates the shorter waves to objects within 
the greenhouse. 


1 This example, and examples of the pied outline form, are included in M. w. 
Richardson, and others, Manual of Examination Methods, Chicago: University of 
Chicago Book Store, Preliminary Edition, June, 1933. 


$ 


Constructing Short-Answer Tests 179 


mn. 4. The glass roof reflects the. longer heat waves, 

see 5. The shorter heat waves from the sun are transmitted readily 
through the glass, 

m 6. The glass roof absorbs long heat waves. 


Now rearrange the pertinent statements above in the proper order to 
give a thoroughly complete explanation. Use the numbers, 


The pupil may also be required to rearrange items and statements 
in outline form with the proper headings, classifications, and 
subordinations, 

Advantages and Limitations of the Outline Form.—More than ip 
Most other types, the mental processes involved in rearrangement 
items depend upon how the subject matter has been presented in 
the classroom. If the proper order has been explicitly presented in 
class, the test may measure only rote memory. Higher levels of 
understanding are elicited only if the pupil is required to make the 
rearrangement originally, 

Rules and Suggestions for Scoring Rearrangement Tests—Apart 
from its scoring, the considerations in designing rearrangement 
tests are similar to those discussed in connection with the other 
forms of items, Although the discussion of scoring methods prop- 
erly belongs under another heading, it will be given here for rear- 
Fangement items because it is an intrinsic part of this form. 

Where the number of parts to be rearranged is small, say four or 

it is sufficient to consider the response correct only if the re- 
Arrangement is correct for all parts, and to give one point for the 
Correct answer. Where more than four parts are to be arranged, a 
Practical scoring scheme is to summate the arithmetic differences 
(ie, differences without regard to sign) between the student’s order 
and the correct order, and then subtract this sum from the sum of 
the differences which results from the greatest possible variation 

9m the correct order, For example: 


Correct order: 12345 
Worst possible answer: 54321 


180 How to Evaluate 


The sum of the differences between the correct and the worst 
possible arrangement is 12. If a pupil’s arrangement is 3 1 2 4 5, the 
sum of the differences from the correct arrangement is 4. His score 
would then be 12 — 4, or 8. 

Sims (10) has presented a method of scoring rearrangement tests 
so as to allow for chance. His formula is: 


Score = n — 3d 
n 
where m = the number of items to be rearranged 
d =the sum of the differences (without regard to sign) between 
the correct numbering and the pupil’s numbering of the items 
For most economical use, Sims recommends that one should first 
find the mean chance arrangement for the set being scored, My, 
n—1 


which equals 


Then the even-numbered values between zero and M, are sub- 
stituted in the scoring formula and a simple table showing the cor- 
respondence between deviations (d’s) and score is prepared, All 
d's greater than M, are treated as chance scores and scored zero. 
Scoring then involves finding the sum of the pupil's deviation from 
the key and reading the score from the table. 

Pied outline tests may be scored by allowing one point for each 
part of the outline which is correctly placed, correct placement be- 
ing defined as “under the proper heading.” 


EVALUATION TECHNIQUES EXPLICITLY AIMED AT 
HIGHER MENTAL PROCESSES 


In the course of the Eight-Year Study of the Progressive Education 
Association (x) an evaluation staff under the direction of Ralph W. 
Tyler constructed a series of evaluation devices aimed at various in- 
structional objectives not explicitly approached by most ordinary 
achievement tests. Among the tests constructed for the Eight-Year 
Study are the following: * 

Test 1.3. Application of Principles—Aspects of Thinking 
Test 1.3b. Application of Principles in Science 


* Evaluation in the Eight-Year Study, published by the Progressive Education 
Association, University of Chicago. 


Constructing Short-Answer Tests 18x 


Test r.3r. Application of Principles in Chemistry 

Test 1.32. Application of Principles in Physics 

Test 2.5. Interpretation of Data 

Test 5.1. Problems Relating to Proof in Mathematics 

Test 5.11. Application of Certain Principles of Logical Reason- 

ing 

Test 5.21. Nature of Proof 
One distinguishing feature of these tests is their presentation of 
original problem material differing from that presented in class- 
rooms and textbooks. A second feature is their orientation to a single, 
somewhat specific instructional objective expressed primarily in 
terms of a mental process, such as “ability to apply principles,” 
rather than in terms of knowledge of subject matter, such as 
history or chemistry, A third characteristic is the relevancy to every- 
day problems of the material contained in the test questions and 
Situations. 

The steps involved in constructing a representative evaluation 
device of this type, a test of ability to apply principles, have been 
described by Raths (8). The first step is to decide upon the facts, 
Principles, and definitions which have been taught and ability to 
Apply which is to be tested. The second is to set up a group of prob- 
lems involving in turn each of the facts, principles, and definitions, 
The specific terms in which the problem is put should be new rather 
than familiar to the pupils; these terms, or settings, should be in- 
teresting and should represent important aspects of the life ex- 
Periences of pupils, The third step is to present two or more 
plausible alternative solutions to each problem, from which the 
Pupil will select the one he considers most reasonable in the light 
of his knowledge or most consistent with the facts given. The 

urth step is to formulate a series of reasons or statements from 
Which the Pupil will choose those supporting his solution of the 
Problem. This series should include plausible statements support- 
"hg cach of the alternative solutions or conclusions. 

lustrative of the end product of this procedure is the following 
Problem from Test 1.3b, Application of Principles in Science. The 
Pupil is required to indicate whether he agrees with, is uncertain 
out, or disagrees with a stated conclusion, and to select reasons 
to explain his decision, 


182 How to Evaluate 


Problem II 


Two new electric irons (each rro volts, 500 watts) have been used 
an equal length of time. The heating efficiency of both irons has 
creased by approximately the same amount. 


Directions: A. If you are uncertain about the truth or falsity of the 
italicized statement, place a mark in the box on the answer sheet | 
under 4, E 
B. If you think that the italicized statement is quite likely to be zrue, | 

place a mark in the box on the answer sheet under B. = | 
C. If you disagree with the italicized statement, place a mark in the | 
box on the answer sheet under C. 


Directions for Reasons: If you placed a mark under A, select fro 
the first ten reasons given below all those which help you to exp 
thoroughly why you were uncertain and place a mark in Column 
opposite each of the reasons you decide to use. 

If you placed a mark under B, select from reasons 77 through 24 | 
all those which help you to explain thoroughly why you agreed with | 
the italicized statement and place a mark in Column B opposite each. 
of the reasons you decide to use. j 

If you placed a mark under C, select from reasons z7 through 
all those which help you to explain thoroughly why you disag 
with the italicized statement and place a mark in Column C opposi 
cach of the reasons you decide to use. 


Reasons to be used if you are UNCERTAIN: 


1. I have never used electric irons enough to know whether they 
sistently give more or less heat after they have been used for a tin 
2. The heating efficiency of an iron may be affected by the voltage 
constancy of the electric current maintained in the power line. 
3. The frequent removal of the plug from an iron while it is bei ng 
used may result in more rapid destruction of the contacts. 


7. The phrase "approximately the same amount" needs to be n 
more definite. 


Constructing Short-Answer Tests 183 


9. I do not know what factors may affect the heating efficiency of an 
iron. 

Reasons to be used if you AGREE or DISAGREE: 

11. It is commonly observed that electrical appliances lose their efficiency 

at a constant rate. 

It is silly to think that the use of two irons for the same length of 

time would not result in about equal amounts of deterioration. 

13. A reduction in the flow of current through an iron is accompanied 
by a reduction in the heat developed by the iron. 

14. lrons that have been used for equal lengths of time will deteriorate 
about equally. 

15. The gradual oxidation of the wires in the heating element of irons 
introduces additional resistance in the form of an insulating layer. 

16. An electrical current flowing in an iron must spend some of its 
energy in overcoming the additional resistance that nature gradually 
develops to prevent its flow. 


12. 


19. Just as the wires in an electric light bulb will in time become 
burned and less efficient, so will the heating element in an electric 
iron become worn with use, 

20, Manufacturers of electric irons say that irons kept in good repair 
will maintain their heating efficiency, 

22, Burned wires decrease the heat developed in electric irons just as 
decreasing the friction in automobile brakes develops less heat, 


The pupil’s achievement on the test presenting eight such prob- 
lems is analyzed in terms of the extent to which he reaches valid 
Conclusions and also in terms of the kinds of reasons he selects to 
explain his decisions about the stated conclusions. The reasons from 
among which the pupils choose are classified into various types. 

€ classifications of the reasons quoted from the above problem 
are as follows: 

Reason 1 is based on Lack of Experience 

Reason 2 is based on Control 

Reason 3 is based on Control 

Reason 7 is based on Definition of Term 

Reason 9 is based on Lack of Knowledge 


184 How to Evaluate 


Reason 11 is based on Poor Practice 

Reason 12 is based on Ridicule 

Reason 13 is based on Right Principle 

Reason 14 is based on Assuming the Conclusion 

Reason 15 is based on Wrong Principle 

Reason 16 is based on Teleology 

Reason 19 is based on Analogy 

Reason 20 is based on Authority 

Reason 22 is based om Poor Analogy y 

These various types of reasons are considered representative ol 
those given by pupils in essay tests presenting similar problem 
The concrete meaning of each type of reason is probably best se 
forth by the illustrative item given for it above. The pupil’s choi 
of a conclusion and his supporting reasons can thus be analyzed in 
the types of reasoning responsible for the correctness or incorrect 
ness of the choice. For each pupil’s test paper a set of scores is 
tained showing the percentages of desirable and undesirable choic 
from among the total number of opportunities for choosing 
type of reasoning. : 
Ability to interpret data has been evaluated by tests constructe 

according to the following procedure. Several sets of data, such | 
statistics in a table, or graphs, maps, cartoons, and pictures 
selected for intepretation by the pupils. Then for each set of 
a series of conclusions are drawn, some of which are fully sup 
ported (true), partially supported (probably true), not supporte 
(uncertain), partially contradicted (probably false), and comple! 
contradicted (false) by the set of data. The pupil is require 
indicate for each conclusion the one of these five classes into wl 
it falls. His responses, or classifications of the suggested conclusi 
may then be interpreted so as to yield scores for general accuracy 
accuracy with probably true and probably false statements, accu 
with uncertain statements, general tendency to go beyond the data 
tendency to ascribe more truth ‘to statements than is warranted 
tendency to ascribe greater falsity than is warranted, and tendenc 
to make such crude errors in judgment as confusing probably tru 
with probably false or true with false statements. The followin, 
problem is used by Raths (8 : 101-102) to illustrate a test of abilit 
to interpret data. 


Constructing Short-Answer Tests 185 


PERCENTAGE DISTRIBUTION or GarwrULLY Occurimp 
Persons iN THe Unrrep STATES SIXTEEN YEARS oF AGE AND Over, 


1870-1930 


Occupation Group 


l 
Q) eIGO|!O|!G|o/|o!d 
25.8) 21.3 
2.7] 20 
30.5 | 28.6 
18.0 | 207 
7.2 8.2 
8. 11.3 
1.6 14 
54| 65. 


Directions: Assuming that the data in the above table are true, you . 
are to evaluate each of the following. statements, by writing after each 
statement the number 


1—if the statement is true 
2—if the statement is probably true 
3—if the evidence is not sufficient to indicate that there is any 
degree of truth or falsity in the statement 
4—if the statement is probably false 
5—if the statement is false 
Statements; 


1. There has been a persistent decline in the percentage of gainfully 
occupied persons in the United States, sixteen years of age and over, 
in the field of agriculture from 1870 to 1930. 
The number of ministers has increased from 1870 to 1930. 
In 1916, the largest percentage of occupied persons was to be found 
in manufacturing. 
In Great Britain shifts in occupations of people gainfully employed 
have taken place which are similar to those in the United States. 
5 In 1930 the percentage of gainfully occupied persons was greater 

than in 1850 for all occupations excepting agriculture. 

[and so on for 11 more statements] 


ee 


F 


A further variation of these tests aimed at types of “thinking” is 
the technique used to evaluate the pupil's understanding of the 
nature of proof, A problem and conclusion are presented, followed 


186 How to Evaluate 


by a randomly arranged series of statements expressing facts, i 
plicit assumptions, restatements of the conclusion, irrelevant state- 
ments, and inconsistent or contradictory statements. The pupil is. 
required to indicate the class in which each statement belongs, to 
select:three crucial assumptions of the argument, and to demonstrate 
in other ways his ability to deal with various forms of argument, 
His responses may be analyzed diagnostically into various total an 
subscores in a fashion similar to that described for the test of abili 
to interpret data. The following is a typical problem from T 
5.21, Nature of Proof. 


The following advertisement appeared in a teachers’ association 
magazine: 

“The gentle, rhythmic chewing of gum helps increase the blood fl 
to your head. This tends to make you feel more wide awake and hen 
keener-minded. And, at the same time, sweet, pleasant-tasting chewii 
gum supplies a quick pick-up of energy. That is why chewing g 
helps keep you alert at your work. There's a time and place for Chewii 
Gum." Mary, after reading this advertisement, decided that she wo: 
be more alert in school if she chewed gum. 


Statements: 


1. Chewing gum contains some ingredient which is a source of energy: 

2. The chewing of gum helps keep the teeth in better condition. 

3. It is quite possible for a person to be wide awake and yet not have 
a keen mind. 

4. The chewing of gum increases the flow of blood to the brain. 

5. Chewing gum is socially acceptable. 

6. An increase in the flow of blood to the brain results in one's being 
more wide awake. 1 

7. Many people enjoy chewing gum. 

8. The source of energy in chewing gum can be quickly utilized. 

9. People who are wide awake usually have keen minds. 

o. The chewing of gum may actually decrease the amount of food a 
oxygen reaching the brain. 

11. Many good magazines advertise chewing gum. 

12, Even if the chewing of gum did increase the flow of blood to th 
brain, the increase would probably be too small to have any 
effect. 

13. Many school children chew gum. 

14. People who are wide awake and keen-minded are usually alert. 


Constructing Short-Answer Tests 187 


The validity and reliability of the various tests of this type have 
been reported in the volume summarizing the work of the Evalua- 
tion Staff of Eight-Year Study (1x). In general, the reliability in 
terms of the Kuder-Richardson formula (see pages 204-205) was 
found to cluster around 90 for individual grades in various private 
and public schools. The argument for the validity of the tests was 
based on, first, the procedures used in selecting test material, and 
second, on correlations between the test scores and essay tests of 
pupils’ ability to write original interpretations of data and ap- 
plications of logical reasoning. It is evident from the illustrations 
that the types of test items employed represent no radical departure 
from the constantalternative and changing-alternative types de- 
scribed carlier in this chapter. Rather it is the nature of the problems 
Presented, their subject matter, and their more thoroughgoing 
orientation to defined objectives other than information and factual 
knowledge that constitutes the major change from more traditional 
achievement tests, 

No evidence has been presented concerning the extent to which 
the abilities approached by tests of this type are amenable to im- 
provement through teaching and instructional procedures or, on the 
other hand, are determined mainly by factors affecting a pupil’s 
mental growth independent of classroom instruction. As was noted 
in Chapter II, the evidence that the achievements evaluated by these 
tests are relatively more permanent, less “forgettable,” than the 
achievements evaluated by more traditional tests may mean that 
Cither (1) these tests evaluate achievements not of instructional 
objectives but rather of the goals of general mental development or 
(2) they evaluate achievements of instructional objectives which 
represent the more fundamental and functional goals of educational 
€ndeavor, Until further evidence is brought forth it will be impossible 
to determine whether these tests of various “intangibles,” or “higher 
Mental processes,” approach aspects of pupils which can be acquired 
and improved through instructional effort on the part of the teacher, 


ARRANGING THE SHort-ANswer Test 


, After the test items have been constructed, they must be arranged 
In such a form as to maximize efficiency of administration and inter- 
Pretation, This involves attention to the following: 


188 How to Evaluate 


r. Assembling items into parts 

2. Editing the items 

. Order of difficulty within parts 

. Arranging items for efficient scoring 
Preparation of the scoring key 

. Providing directions for the pupil 

. Providing directions for the test administrator 

1. Assembling Items into Parts—At the end of the process of 
constructing the proper items covering the whole table of specifica- 
tions, the teacher will be in possession of a collection of test items 
in varying forms. The number of items will be. whatever is sufficient 
to cover adequately the instructional objectives at which the test is 
aimed; obviously, therefore, the number will vary in accordance 
with the extensiveness of these objectives. In most cases it will be 
somewhere between fifty and two hundred. 

‘Two ways of assembling these varying types of items into a com- 
plete test are possible: by types of test items (e.g, all the true-false 
together, all the multiple-choice together, etc.) or by units of subject 
matter or instructional objectives (e.g, all items on the Colonial 
Period together, all items on the Revolutionary Period, etc.). The 
first of these is probably preferable from the standpoint of the pupil 
taking the test in that he need not shift his method of working 
from item to item as would be necessary if a true-false item were 
followed by a multiple-choice item and then by a completion item. 
From the standpoint of interpretation, the second type is preferable 
only when sufficient items are included under each topic or objective 
to enable the teacher to make reliable interpretations of specific 
weaknesses and strengths in the pupil’s achievement. Since most 
achievement tests cannot be lengthy enough to provide sufficient 
items in each topic or specific objective for such diagnostic pur- 
poses, part scores on subtests are usually not obtained; rather à 
single score on the complete test only is obtained. Consequently, 
in most cases it is preferable to assemble the items according to 
types; that is, all items of the same type are collected into one part. 
If at this stage it is found that there are too few items of any one 
type to constitute a special part in the test, they may be changed 
over into the most suitable form of those already used. 


MOM US 


Constructing Short-Answer Tests 189 


2,3. Editing the Items and Arranging Them in Order of Diffi- 
culty Within Parts—These two steps are discussed together because 
they may conveniently be carried out at the same time. In editing 
the items a final check should be made to see that all the rules and 
suggestions for constructing that form of item have been observed. 
If possible, the opinions of other qualified teachers should be ob- 
tained by having them read each item carefully and attempt to 
answer it. Whatever rewordings and reconstructions are indicated 
by this critical examination by outsiders should be carried out. 
During or after the editing, an attempt should be made to arrange 
the items of each type in order of difficulty. Since it has been found 
experimentally that subjective estimates of the difficulty of an item 
are seldom more than rough approximations of the difficulty as 
measured by the percentage of pupils failing an item, no attempt 
should be made to obtain more than a rough order of difficulty. 
That is, rather than the items being ranked one after the other, 
they should be classified into a few categories, such as “very 
difficult,” “fairly difficult,” “average,” “fairly easy,” "very easy." 

In this connection it should be noted that the items should range 
in difficulty from very easy to very difficult, with equal numbers of 
items at any point along this range. No item should be so easy 
that 100 per cent of the pupils succeed with it, nor should any be 
so difficult that no one succeeds, Because they do not discriminate 
between good and poor pupils, such items can serve no purpose in 
evaluation except, perhaps, to emphasize and motivate achievement 
of certain objectives to be evaluated in future tests. If the difficulty 
of the item has been properly adjusted, the average score will be 
about half the possible score and the range of scores for the group 
tested will approximate the whole possible range, from almost zero 
to almost perfect. Perfect scores are undesirable because they in- 
dicate that the good pupil has been denied an opportunity to show 
the full extent of his achievement; similarly, zero scores fail to in- 
dicate what little achievement the poor pupil possesses. 

Within each. type-group of test item, the order of presentation 
should be from “easy” through “moderately difficult” to “most 
difficult.” An order of increasing difficulty may be justified in terms 

the pupil’s distribution of working time and of his morale. It is 


190 How to Evaluate 


easily appreciated by the teacher that pupils, especially those of 
mediocre or low achievement, may become discouraged and con- 
fused if they are confronted with more difficult items at the be- 
ginning of the test. Time will be lost which could profitably be 
employed on the easier items. Furthermore, only the better pupils 
are discriminated or "spread out" from one another by the more 
difficult items. Weaker pupils coming upon difficult items at the 
end of the test will already have had an opportunity to indicate their 
full achievement; it will consequently not be adversely affected by 
discouragement and loss of working time. A high-jump competition 
in a track and field contest provides an analogous situation: com- 
petitors begin at low heights and work up to progressively higher 
hurdles, the less able dropping out along the way. 

The pupil morale factor may perhaps be the only justification for 
arrangement in the order of difficulty, in the light of findings by 
Capron (2). With 453 pupils in the Minneapolis fifth and eighth 
grades, she found that the arrangement of items from easy-to-hard, 
hard-to-easy, and random had no significant effects on performance 
in objective tests in spelling, arithmetic, and the fundamental 
arithmetic processes. 

4. Arranging Items for Easy Scoring.—The spaces for the pupil's 
responses should always be arranged in a straight line, usually in a 
vertical column running from top to bottom of the page. This ar- 
rangement can be followed for all types of test items. In simple- 
recall and completion items, where the omitted words or phrases are 
scattered in the middle of sentences or connected discourse, the 
scattering of these blanks may be circumvented by inserting a num- 
ber in them which refers to a specific space for the response at 
the right side of the page. Example: 


Items should be arranged in 1 order of difficulty 1. Increasing 
inadesirable way. 2. 


mainly in order to affect pupil 


"T rue-false items, multiple-choice items, matching items, and all the 
others can be similarly arranged so that the spaces for pupil re 
sponses will be in a vertical column at one side of the page. 

5. Preparation of the Scoring Key—If one copy of a mimeo- 
graphed test is filled out with the correct responses it may be u 


Constructing Short-Answer Tests 19r 


‘scoring key, the test scorer placing it adjacent to the pupil’s 
s and checking the latter according to their correctness or 
ectness. Needless to say, the responses on the scoring key 
be checked and rechecked by the teacher and other subject- 
experts before it is used in scoring test papers. Spaces should 
provided on the pupils’ test papers for recording the sums of 
rect responses, incorrect responses, and omissions for each page 
each part of the test. Similar spaces should be provided on the 
page for total and part scores. 
-6. Directions to the Pupil.—General directions to the pupil for ' 
entire test should appear on the front page with regard to 
riing his name and other identifying data in the proper 
(2) when to begin to work, (3) amount of time to be al- 
(4) observing and following directions for each part of the 
b (5) distribution of his time on easy and hard items, (6) whether 
back to preceding parts is permitted, (7) stopping work 
n told, (8) asking questions, (9) guessing, and the penalties or 
wards for guessing. The teacher’s judgment is a fairly safe guide 
iding the extent of these general directions. It is better, of 
to err in the direction of excessively detailed directions 
‘to run the risk of omitting anything important. 
ections to the pupil for specific parts of the test should tell 
what each type of item gives him and what it requires of 
The form of the directions is determined mainly by the 
rity of the pupils with the type of test item being used and 
ir age. Brevity, simplicity, and completeness should be at- 
in so far as possible. Wherever necessary, sample items al- 
answered should be provided to show the pupil how to pro- 
imilarly, practice exercises may be provided. If the test is a 
mit test such exercises should be given outside the time limit 
ne test itself. The pupil should gain from the directions a clear 
E he is to do and of where and how he is to record what 
ne. 
Directions to the Test Administrator.—Whether the teacher 
& test of his own or whether it is externally made, directions 
ing it should establish for him a clear idea of what to do 
What not to do. He should make whatever special provisions 
necessary for (1) furnishing pupils with pencils and other 


192 How to Evaluate 


equipment such as answer sheets, (2) passing out test booklets, 
(3) reading directions with the pupils and answering questions 
concerning them, (4) giving starting and stopping signals, (5) ob- 
serving’ total and part-time limits, and (6) collecting test papers 
and other equipment. 


QUESTIONS 


1. Write several faulty-improved pairs of test items to illustrate each 
of the first six in the general rules on pages 148-149. Be sure that 
your items contain no faults other than those you are illustrating. 

2. Construct a table of specifications for a short-answer test based on 
this chapter. 

3. Compose a set of master list items, one list of which contains the 

names of six types of short-answer items and the other a series of 

twenty or more advantages and disadvantages of the various types. 

Has your own test-taking or test-administering experience cver been 

marked by irritation with the arrangement of the test with respect 

to order of difficulty, method of indicating response, or directions 
to pupil or administrator? In what ways? 

5. Discuss the relationship between the nature of the achievement 
tapped by any test item and such factors as teaching practices, learn- 
ing methods, and recency of learning. 


4 


REFERENCES 


Aikin, W. L, The Story of the Eight-Year Study, New York: Harper 

& Brothers, 1942. 

2. Capron, Virginia L., “The relative effect of three orders of arrange- 

ment of items upon pupils’ scores in certain arithmetic and spelling 

tests,” Journal of Educational Psychology, 24 : 687-695 (1933). 

Cronbach, L. J., “An experimental comparison of the multiple truc- 

false and multiple-choice tests,” Journal of Educational Psychology, 

32 : 533-543 (1941). 

4. Kelley, V. H., “An experience with multiple-choice vocabulary tests 
constructed by two different procedures,” Journal of Experimental 
Education, 5 : 249-250 (1937). 

5. Lee, J. M., and Segel, D., Testing Practices of High School Teachers, 

Washington: U. S. Office of Education, Bulletin No. 9, 1936. 

Lindquist, E. F., in Hawkes, H. E., and others, The Construction 

and Use of Achievement Examinations, Boston: Houghton Mifflin 

Company, 1936. 


^. 


v 


a 


r 


Constructing Short-Answer Tests stg 


.. Odell, C. W., Traditional Examinations and New Type Tests, New 
York: D. Appleton-Century Company, Inc., 1928. 

Raths, L. E., "Techniques for test construction," Educational Re- 

- search Bulletin, 17 : 83-114 (1938). 

Richardson, M. W., and others, Manual of Examination Methods, 

preliminary editions Chicago: University of Chicago Book Store, 

1933. 

Sims, V. M., "Note on scoring the rearrangement test," Journal of 

- Educational Psychology, 28 : 302-304 (1937). 

Smith, E. R., Tyler, R. W., and the Evaluation Staff, Appraising 

and Recording Student Progress, New York: Harper & Brothers, 

1942, Vol. 3. 

Stalnaker, J. M. and Ruth C., “Chance vs. selected distractors in a 

vocabulary test," Journal of Educational Psychology, 26 : 161-168 

- (1935). 

. Tyler, R. W., Constructing Achievement Tests, Columbus: Ohio 

State University, 1934. 

» West, P. V., “A critical study of the right minus wrong method,” 

- Journal of Educational Research, 8: 1-9 (1923). 

. Wiley, L. N., and Trimble, O. C., “The ordinary objective test as a 

possible criterion of certain personality traits,” School and Society, 

43 : 446-448 (1936). 


CHAPTER X 


Choosing Standardixed Tests 


IN THIS CHAPTER WE SHALL BE CONCERNED WITH THE SIXTH QUESTION 
listed at the beginning of Chapter VIII: 

If an externally-made, standardized, short-answer test is used, 
how shall it be chosen from among the many available? 

It will be recalled from our discussion of the relative merits of 
standardized versus teacher-made tests that the chief advantages of 
standardized tests are their possession of norms and their greater 
technical refinement. On the other hand, it was concluded thal 
teacher-made tests usually coincide more closely with the instruc- 
tional objectives of particular teachers, yield greater benefits to th 
teacher, and are more adaptable to continuous evaluation through: 
out a semester. It was seen that for certain kinds of evaluations, for 
certain instructional objectives, and for certain kinds of interpreta- 
tions, the standardized test is superior. The problem of maximizing 
the usefulness of standardized tests was seen to resolve itself into 
two subordinate questions: Under what circumstances should the 
standardized test be used? How can the best standardized test fot 
a particular situation be selected from among the many available? 

In general, standardized tests should be used when the instr 
tional objectives whose achievement the teacher desires to evaluate 
are objectives for which such tests have been designed. The ob- 
jectives will usually be those that are common to a great many 
classrooms, teachers, and school systems. For example, the ob 
jectives of instruction in arithmetic, reading, writing, and othe! 
tool subjects do not usually vary so much from classroom to class- 
room and from school system to school system that standardized 
tests designed for widespread use will be found grossly unfit fo 

194 : 


f 


Choosing Standardized Tests 195 


use in a particular classroom. Similarly, in other subjects such as 
American history or general science there is still enough in common 
between the objectives of different teachers and school systems so 
that valuable standardized achievement tests can be made. Granted 
the communality of instructional objectives among various class- 
rooms in specific subjects and the feasibility of using standardized 
tests in order to realize upon their advantages in norms and re- 
finement, the problem becomes one of selecting the proper standard- 
ized test. That is, the question of when such tests should be used 
involves the selection of the proper one to fit certain instructional 
objectives. This process in turn leads to a consideration of the 
criteria by which these tests should be judged and upon the basis 
of which one may be selected from among the many available. 
Certain criteria for the selection of a standardized test are ap- 
plicable to all evaluation devices including teacher-made short- 
answer tests, essay tests, product and behavior rating devices, and 
evaluation devices for all other aspects of pupils. Consequently, the 
present discussion will lay the general foundation for understanding 
the criteria by which all evaluation devices should be judged. In 
later chapters we shall consider the methods of applying these criteria 
to other specific evaluation devices. Here we shall take up the 
general nature of these criteria and the means by which they may 
be applied specifically to standardized achievement tests. 
_ The basic criteria for evaluation devices are (1) validity, (2) re- 
liability, (3) administrability, and (4) interpretability. Let us now 
turn to the discussion of each of these criteria, their interrelation- 
ships, and how they may be applied in selecting a standardized 
achievement test. 


VaLIDITY 


Validity is the degree to which an evaluation device measures 
What it purports to measure. Implicit in this definition are (1) the 
Specific nature of validity and (2) the quantitative nature of validity. 
Validity is a specific concept in that it must always refer to a specific 
Purpose or objective and a specific group of pupils. Given a specific 
Broup. of pupils, an evaluation device or test will be valid for one 
Purpose but not for others. For example, a foot rule is valid for 
Measuring the length of a room but not for measuring its tempera- 


196 How to Evaluate 


ture; a history test is valid for measuring achievement in history 
but not for measuring achievement in French. Similarly, within a 
given set of instructional objectives, an evaluation device may be 
valid for evaluating achievement of one objective but not for 
others, Thus a history test may be valid for evaluating ability to 
recall facts and definitions but invalid for evaluating ability to 
interpret data or to apply principles involving historical materials. 
That is, the concept of validity has meaning only in relation to 
specific purposes, subject matters, and instructional objectives. 

The specificity of the concept of validity holds also for the mate- 
rials or subjects to which it is applied. Here again a physical 
analogy is useful. A foot rule is invalid for measuring the distance 
between the stars although it is valid for measuring the distance 
between two edges of a table. Similarly, it is invalid for measuring 
the size of microscopic objects, such as bacteria or dust particles. 
In the same way a history test may be valid for distinguishing be- 
tween levels of achievement in one group of pupils but not in others. 
"Tests in American history designed for elementary school pupils 
may thus be invalid for high school pupils or college students. 
English tests for high school pupils may be invalid for evaluating 
the achievement of college students in a specific aspect of English. 
In short, an evaluation device is valid only in terms of a specific 
purpose to be carried out with a specific group. 

'The quantitative nature of the concept of validity may be in- 
ferred from the method by which validity is frequently measured. 
The purpose of a test or other evaluating device is realized to the 
degree in which the results of the test correlate with data concerning 
the objective obtained by some method or criterion whose validity is 
already known or assumed. The validity of a foot rule may be 
defined as the degree to which results obtained with that foot rule 
agree with the results obtained with some standard yardstick, 
for example, in the U. S. Bureau of Standards. The validity of a 
history test similarly will vary in the degree to which scores obtained 
on it agree with other judgments, such as the teacher's ratings, of 
achievement in history. The standard used as the basis of comparison 
in determining validity is the criterion. i 

Obviously, validity is the most important characteristic of añ 
evaluating device. Unless the device possesses a satisfactory degree of 


Choosing Standardized Tests 197 


validity for a specific purpose with a specific group it is worthless, 
no matter how well it satisfies the other requirements for evaluating 
devices. 

What kinds of criteria have been used in validating standardized 
achievement tests? Our answer to this question will, of course, also 
be useful to teachers in constructing their own achievement evalua- 
tion devices. The criteria which have been used may be divided 
into two interrelated classes: 

I. Criteria with which to compare test content 
2. Criteria with which to compare test scores 

Criteria with which the content of a test may be compared may 
in turn take the following forms: (1) analyses of courses of study, 
(2) statements of instructional objectives, (3) analyses of textbooks, 
(4) analyses of teachers’ final examination questions, (5) pooled 
judgments of competent persons, (6) concepts of social utility, and 
(7) introspective logical or psychological analyses of mental proc- 
esses, 

Criteria with which scores obtained with an evaluating device 
may be compared are the following: (1) school marks, (2) increases 
in percentage of success in successive ages or grades, (3) differences 
in scores obtained by any two or more groups known to be widely 
Separated in ability, (4) ratings of pupils by competent raters and 
teachers, and (5) correlations with other tests. 

Criteria of the first type, those used for test content, have been 
called curricular criteria; validity obtained through the comparison 
Of test scores with the second class of criteria has been termed 
Statistical validity. 

By far the most important criterion of any evaluation device is its 
Social utility, the degree to which it satisfies the aims of the educative 
Process as defined by the needs of society on the one hand and 
the needs of the individual on the other. In Chapter I the major 
function of evaluation was determined to be the provision of data 
for guidance purposes. Guidance in turn has as its major function 
the securing of an optimum fit between the individual and the 
Social order, Obviously, the validity of any evaluation device must 
Sventually be traced to and measured by the degree to which it 
Satisfies the interacting needs of individuals and the social order. 
In Chapter II it was seen that the setting up of instructional ob- 


198 How to Evaluate 


jectives, which is mainly a curriculum problem, could be strongly 
affected by the needs of evaluators for patterns by which to design 
their devices. That is, not only do curriculum construction and the 
formulation of instructional objectives determine directly what is 
to be evaluated and indirectly how the evaluation is to be made, 
but the limitations and possibilities of evaluation in turn react 
upon the content of statements of instructional objectives. The 
why, the what, and the how of evaluation are thus all intermingled, 
interdependent, and interacting. The major criterion of an evalua- 
tion device therefore finds its basic meaning in terms of social and 
individual needs. It is in terms of this criterion that not only the 
instructional process should be validated but also every phase of the 
evaluation process, from the total testing program through specific 
tests down to individual test items. Evidence from each of these 
steps in evaluation must be validated against the major aims of 
education, or educational objectives, as they are related to social and 
individual needs. 

On this ground, many evaluation projects can immediately be de- 
clared invalid by their lack of comprehensiveness, since they deal 
with only a few relatively minor educational objectives. It is of 
course the aim of the present book to increase the comprehensive- 
ness of evaluation by outlining techniques for the evaluation not 
only of pupils’ information, verbal and numerical skills, but also of 
their ways of thinking, attitudes, work habits and study skills, gen- 
eral and specific mental abilities, physical aspects, socio-economic 
background and environment. Evaluation will acquire increased 
validity in so far as it becomes more comprehensive in terms of 
these aspects of pupils. 

Validation of test content against the content of textbooks is useful 
because it insures to a high degree that pupils will be evaluated on 
the basis of what they have actually been taught. If test content so 
selected is then found wanting the blame must be laid not upon the 
test constructors but rather upon the author of the textbook and 
those who have made wide use of it, Of course, the statement of 
instructional objectives which is embodied and implicit in any 
textbook has a specific and definite form, whereas statements of 
objectives embodied in the usual course of study are far less de- 
tailed, tending to deal more with general philosophy, aims, and 


Choosing Standardízed Tests 199 


principles. Especially for subjects which are new to the curriculum 
and for which there is not yet developed more than a general 
philosophy and general objectives, the textbook may embody the 
most useful statement of objectives available. The danger in the 
use of this method is, of course, that it is likely to cause a premature 
crystallization of content rather than stimulate the progress of 
instructional objectives toward greater social utility, Analyses of 
Courses of study and of teacher-made final examinations offer similar 
advantages and disadvantages as criteria for the validation of 
achievement tests. 

Criteria for the validation of test scores rather than test content 
Tequire the application of statistical correlation techniques. The test 
must first be administered to the group of pupils for which the 
validity of a given test is being determined. For each pupil who is 
given a score on the test another criterion score is obtained. The 
criterion score may be a teacher's mark or rating in the subject, a 
Score on a standardized achievement test whose validity is known or 
assumed, a measure of the pupil’s ability in terms of either an in- 
telligence test, an aptitude test, grade level, or chronological age, 
9r a combination of these and other criteria. Test scores and 
Criterion scores are then correlated by the appropriate statistical 
techniques. The essential principle in all of these techniques is to 
determine whether the average score on the test rises as the average 
Score on the criterion becomes higher. For example, the average 
Score on an arithmetic test of fifth-grade pupils should be higher 
than that of fourth-grade pupils, because the latter have had less 
Instruction in arithmetic. If the expected difference in the average 
test scores is not obtained, either the test is invalid or the arithmetic 
instruction has not been effective. The better the agreement be- 
‘ween criterion scores and test scores, the more valid the test. 

OF these two general classes of test criteria, the one more fre- 
Suently used for standardized achievement tests is that involving 
fest content rather than test scores, for it is very difficult to secure 
established, valid criterion scores in quantitative form for school 
Subjects, The teacher’s marks, while the most frequently used 
Criterion for test scores, are themselves far from perfectly valid 

“cause they are influenced by many other factors than those which 
should determine scores on achievement tests. Teachers base their 


200 How to Evaluate 


marks not only on achievement but also on attendance, classroom 
behavior, recitations, work habits, general mental ability, and other 
aspects of pupils which may or may not be related to achievement 
test scores. Consequently, although the correlation of test scores 
against the teacher's marks is usually far from perfect, the validity 
of the tests cannot be considered imperfect to a similar degree. 

How can a teacher discover by what criteria a standardized 
achievement test has been validated and what degree of validity has 
been obtained between tests and criteria? Such information can 
usually be obtained from the manual for the test. Such manuals 
should contain information concerning the authors of the test, 
how the test items were chosen, what criteria were used in validating 
the test, and what degree of validity was obtained with these criteria, 
The absence of such information from the manual should incline 
the teacher to judge the test as of doubtful validity. Of the thou- 
sands of standardized tests which have been published, only a small 
minority have been accompanied by manuals giving sufficient in- 
formation to enable the teacher to make an accurate judgment of 
the validity of the test. To supplement the information in these 
manuals, or as a substitute if there is no manual, teachers may 
consult various educational journals which report experimental 
studies of the test. These studies report not only data concerning 
test validities but also comparisons of tests in a given field with 
respect to their usefulness in various educational situations. 

Extremely valuable in this connection are the Mental Measure- 
ments Yearbooks edited by Buros (x). In these volumes, of which 
the 1938 and 1940 editions have been published, are presented critical 
evaluations by competent experts of large numbers of standardized 
tests in most of the subject-matter fields, and also other non-achieve- 
ment aspects of pupils, These test reviews cover not only newly 
published tests but also older, widely used standardized tests. Com- 
petent reviewers representing a wide variety of positions and points 
of yiew among both actual and potential test users, both curriculum 
and teaching specialists and test technicians, are selected to make the 
criticisms of each test. A perusal of these yearbooks prior to selecting 
any standardized test will furnish teachers and others responsible 
for the selection of tests with exactly the type of data upon which 
an intelligent judgment should be based, 


Choosing Standardized Tests 201 


In the final analysis, however, the selection of any test should be 
based upon the teacher's own examination of its content and on 
his opinion of its fitness for his specific needs and purposes. The 
first-hand data obtained by this close inspection of tests and manuals 
may then be judged in accordance with the requirements of good 
tests as presented in this chapter. Not only validity but also reliabil- 
ity, administrability, and interpretability of the test may thus be 
judged. 

If time does not permit the teacher to order and receive specimen 
copies of tests from the test publishers or to subject a fairly large 
number of tests in a given field to the necessary critical appraisal, 
such an appraisal may nevertheless be made after the test has been 
used and it may considerably modify his interpretation of the re- 
sults obtained with it, Thus, if a test has been found upon inspec- 
tion to have important shortcomings with respect to the teacher’s 
needs for validity, reliability, etc, or to neglect or overemphasize 
Certain instructional objectives, the interpretation of scores should 

appropriately modified. There is no substitute for the teacher’s 

miliarity with whatever evaluation devices are used, And, of 
course, one of the best aids to acquiring this familiarity is to take 
the test before administering it to pupils. 


RELIABILITY 


Reliability is the accuracy with which a test measures whatever it 
does measure, What a test measures may not be what it purports to 
Measure; but if it measures something accurately, then it is a reliable 
test. That is, if a test is valid it must of necessity also be reliable, 
Whereas a reliable test may not be valid. For example, if a foot rule 
18 used to measure temperature it will give reliable, consistent re- 
Sults but will have no validity for that purpose; one’s personal 
comfort although unreliable and inconsistent would be far more 
valid for this purpose. Thus an unreliable measuring instrument if 
Correctly applied may be more valid than a reliable instrument ap- 
tice’ to purposes for which it is unfit. Reliability is thus seen to be 
Included within the concept of validity; it is one aspect of validity, 
necessary but not sufficient to it. In order to be valid a test must 
E some degree of reliability, but the converse does not always 


202 How to Evaluate 


The methods of estimating reliability all involve some means of 
securing at least two measures with the same instrument or with 
different forms of the same instrument and determining the agree- 
ment between them. For example, if a foot rule is applied to meas- 
uring the length of a table ten times, the disagreement among the 
ten measurements will indicate the unreliability of the foot rule 
and its application. If the ten measurements of length agree perfectly 
with one another then the foot rule is perfectly reliable; if the 
length is first 50 inches, then 49 inches, then 51 inches, then 50 
inches, then 48 inches, and so forth, the unreliability can be stated 
in terms of the amount of variability among the different measures 
of length. Similarly, with evaluation devices in education, the 
closer the agreement between successive applications of the device, 
the greater its reliability; the greater the variability, the less the 
reliability. 

The emphasis of the present discussion on accuracy of measure- 
ment as the crux of the concept of reliability follows that of Jackson 
and Ferguson (3 : 22-24). 

The reliability of tests may be estimated by the test-retest, the 
parallel form, or the split-test methods. 1. The test-retest method 
requires that the same evaluation device be applied more than 
once to the same group of pupils. The agreement between the scores 
obtained by the two or more applications of the same test is deter- 
mined by means of a correlation coefficient. This method, however, 
is seldom used because of the following disadvantages: repeating 
the test at too short an interval introduces the memory factor and 
tends to make the self-correlations of the test too high; on the 
other hand, repeating the test after a longer time interval permits 
Such factors as growth, intervening learning, and unlearning to 
come into play so as to lower the self-correlation below what it 
should be. 

2. The alternate-forms method of estimating reliability makes it 
possible to avoid the disadvantages of too short or too long a time 
interval between successive administrations of the evaluating device, 
Two parallel forms of the device must be constructed so as to be 
as similar as possible in content, mental processes required, length, 
difficulty, and all other respects. One form of the test is given to the 


Choosing Standardized Tests 203 


pupils and then, as soon as possible, the other form is given. The 
agreement. between the two is again determined by means of a 
correlation coefficient. If the agreement is high, it is possible to say 
that each form does an accurate job of measurement. 

The question may occur to the reader how two forms can be 
parallel and equivalent while differing in specific content or test 
items. The notion of equivalent forms may become clearer when it 
is compared with the process of sampling from any large popula- 
tion. Thus, when a medical technician examines a droplet of 
blood for the number of red corpuscles it contains, he may deter- 
mine the reliability of his count by examining another droplet from 
the same person. The two droplets are equivalent and yet distinct, 
in that no corpuscle of one is contained in the other. Yet the agree- 
ment between the two counts is usually found to be very close. 
Similarly, a test may be considered to be a sampling of test items 
from a large population of possible test items, Each sample of test 
items or form of test will be equivalent to the others although not 
a single test item is common to both forms. If the two forms or 
samples are administered to a group of pupils, the scores on the 
two forms will correlate or agree in proportion to the equivalence 
of the two samples of test items. If the two samples are equivalent, 
it may be inferred that each one constitutes a reliable test. The 
alternate-forms method of determining reliability has the disad- 
vantage that two tests must be prepared even though one could 
fulfill all needs other than those of finding reliability, of testing 
absentees from the first test without enabling them to profit from 
coaching by pupils who have taken it, or of other unusual needs. 

3. The split-test method of estimating reliability avoids the dis- 
advantages of both the test-retest and the alternate-forms method 
and consequently is most frequently used. In this method the 
items of a single test are divided into two chance halves usually 
by pooling odd-numbered items for one score and even-numbered 
items for another score. The odd-even method of splitting usually 
makes the two scores obtained from a single testing reasonably 
equivalent in such respects as practice, fatigue, boredom, mental 
set, item difficulty, and pupil morale. After the test has been given, 
two scores are obtained for each pupil, one on the odd-numbered 


204 How to Evaluate 


and the other on the even-numbered items. 'The agreement between 
these two scores on the same test as determined by a correlation co- 
efficient measures the reliability of the test. Since this reliability 
holds only for one-half the whole test, the reliability of the whole 
test must still be obtained. This is so because the reliability of tests 
varies with their length; hence the reliability of a half test is lower 
than that of the whole. The availability of a technique for estimating 
the reliability of a whole test from that of its halves enables us to 
overcome this disadvantage. This technique, the Spearman-Brown 
prophecy formula (4 :40-41), requires merely the substitution of 
the calculated reliability of the half test in the following equation, in 
order to estimate the reliability of the whole test: 


2 (reliability of half test) 


Reliability of whole test = ee (reliability of AA] 


The applicability of the Spearman-Brown formula depends upon 
the degree to which certain assumptions are met. The two halves 
of the test must be as equivalent as possible in. difficulty, average 
score, variability of scores, and type of items. The halves into which 
the whole test is broken up should, however, be random rather 
than chosen to meet these requirements. Although based on a priori 
mathematical reasoning, the formula has been found experimentally 
to give results in agreement with the actual reliabilities of whole 
tests; that is, predicted and obtained whole-test reliabilities have 
been found to be approximately the same. 

Kuder and Richardson (5) have presented a simpler method, 
based on a rational definition of equivalence, for the estimation of 
reliability coefficients which does not involve the necessity of splitting 
a test into halves, rescoring twice, and calculating a correlation 
coefficient. The only data required for their simpler method are the 
number of items in the test, the standard deviation of the test, and 
the arithmetic mean of the total scores on the test. Assuming that 
all the items in the test measure essentially a single ability (that is, 
“that the matrix of inter-item correlations has a rank of one”), that 
the correlations between the items are all equal, and that all items 
are of the same difficulty, they arrive at the following formula: 


————— eee 


—— 


Choosing Standardized Tests 205 
» -i 


a—1 c? 


fu = 


where n = number of items in the test 
gı = the standard deviation of the total test scores 
za arithmetic mean of test scores — M: 
n n 


q—1—p 


This formula underestimates the reliability of the test in the degree 
to.which there is a variation in difficulty among the items; in any 
case, its estimate is usually lower than that obtained by the split-half 
Spearman-Brown method. If the test items do not vary greatly in 
difficulty, it is probable that the quick estimate afforded by this 
formula is good enough for all practical purposes. 

Factors Affecting Reliability—The length of the test has already 
been mentioned as a factor influencing its reliability. Certain other 
factors should also be discussed here in connection with the inter- 
pretation of a reliability coefficient. 1. The range of talent, achieve- 
ment, or ability of the pupils upon whom the reliability is based 
has a direct effect upon the reliability coefficient; that is, the greater 
the variability in the group of pupils, the higher the reliability 
coefficient. Consequently, the reliability coefficient of a test given 
to several grades is higher than that of the same test given to a 
single grade since the range of achievement is larger in the former 
case. For this reason, the reliability coefficient should always be 
determined upon a group of pupils whose range of achievement is 
similar to that of the group whom the test will be used to dis- 
criminate from one another. That is, the reliability of a test designed 
to reveal differences in achievement in a single classroom should be 
determined upon a group of pupils within a similarly restricted 
range of achievement. Reliability determined on pupils ranging 
among several classrooms or different geographic areas or differing 
in certain other factors affecting achievement will be spuriously 
high and give a false picture of the reliability of the test for use in 
the single classroom. ; 

2. The conditions of test administration and scoring may raise or 
lower the reliability of a given test. (a) Thus, it has been shown by 


206 How to Evaluate 


Lindquist and Cook (6) that for a test of given length there is one 
time-limit which gives maximum reliability; time limits shorter or 
longer than this optimum decrease the reliability of the test. 
(b) Similarly, the mental set of the pupils for accuracy, speed 
of work, motivation or incentive, and emotional stability also affects 
the reliability of their scores. (c) Distractions and accidents, like 
breaking a pencil or finding a defective test blank, lower reliability. 
(d) Cheating by pupils and inaccuracy in scoring due to clerical 
errors adversely affect reliability. These considerations should lead 
teachers to follow carefully the directions for administering the test 
and scoring it if its full potential validity and reliability is to be 
realized. 

3. The construction of the test, the form of test items, their 
difficulty, the objectivity of scoring, and such factors as interde- 
pendent items and item wordings also affect reliability. 

Test reliability increases as the number of response alternatives 
presented by each item increases, up to certain limits. Thus, other 
factors being equal, a multiple-choice test containing five alterna- 
tives per item will be more reliable than a multiple-choice test 
containing three alternatives per item. Remmers and others (7, 8) 
have shown that this increase in reliability resulting from more 
alternatives can be predicted quite accurately by the Spearman- 
Brown formula. These considerations, however, belong more prop- 
erly under the heading of test construction than test evaluation, 
and have already been partly discussed in Chapter IX. Symonds (9) 
has presented a rather complete discussion of these and other factors 
influencing test reliability. 

The Interpretation of Reliability Coefficients —The magnitude of 
correlation coefficients may range anywhere from minus 1.00, through 
-00, to plus 1.00. If the correlation is perfect and positive, so that the 
scores obtained on one form of a test have exactly the same relative 
value as the scores obtained on another form, then the correlation 
coefficient will be plus r.oo. If the scores on the two forms are com- 
pletely unrelated, so that it is impossible to tell what one score will 
be on the basis of the other, the correlation coefficient will be .oo. A 
perfect negative correlation, minus 1.00, means that the higher the 
score on one form of the test, the lower the score on the other form 
and in the same degree. Intermediate values of correlation, such as 


Choosing Standardized Tests 207 


plus 65, mean that the agreement between the scores on the two 
forms is neither perfect nor completely absent, that there is a distinct 
tendency for higher scores on one form to be associated with higher 
Scores on the other. 

How high should the reliability coefficient be for standardized 
achievement tests used by classroom teachers? It is obvious that all 
evaluation devices are both invalid and unreliable to a certain degree. 
‘The question of validity being disregarded for the present, how un- 
reliable may a test be and still be useful for evaluation purposes? 
This depends mainly upon the fineness of the discrimination for 
which the test scores will be used. Frequently quoted as the mini- 
mum satisfactory reliabilities of tests to be used in a single school 
grade are the following figures of T. L, Kelley (4 : 28-29): 

50, for the measurement of general group (grade or school) ac- 
complishments and an estimate of the probable future general group 
success in school work. 

-90, for the measurement of relative differences in achievement of 
the group in two or more scholastic lines and an estimate of the 
significance of such differences, 

:94, for the measurement of the past general scholastic success and 
the future promise of an individual in a specific school subject. 

98, for the measurement of differences in the individual in abilities 
and accomplishments in several scholastic lines and an estimate of 
the probability of persistence of differences, of the sort revealed, in 
future school work or vocation. 

Although frequently cited, these figures are probably too high in 
the case of individual measurements. It seems from the derivation 
presented by Kelley (4 :210-211) that he sets up too small a dif- 
ference in scores as the minimum on which teachers should be able 
to distinguish between different performances of a given pupil. 
(That is, his minimum reliability coefficients are based on the as- 
sumption, “It is ordinarily desirable to distinguish between two 
mean scores differing by as much as 260.”) If less rigid require- 
ments for distinguishing between two scores made by the same in- 
dividual are set up, the reliability coefficients of tests which are 
useful for distinctions between individuals and in individuals may 
be distinctly lower. It is probable, by this reasoning, that tests with 
reliabilities as low as .85 are of distinct value for individual diag- 


208 " How to Evaluate 


nosis. Thus, in medical practice, tests of blood pressure or basal 
metabolism are regularly used for individual diagnosis although 
their reliability ranges from about bo to .9o, which is no higher 
than that of the usual standardized achievement tests, 

It is sufficient, perhaps, to say that the teacher should seck a 
standardized test whose reliability is as high as possible. But this 
reliability coefficient must be interpreted in the light of the groups 
of pupils upon which it is based, of the variability of this group, 
and of the methods used in determining reliability. One method of 
expressing the reliability of scores on a test which has the advantage 
of being independent of the range of talent used in determining 
reliability is the standard error of a true score, This statistic tells the 
range within which scores on the same test would be expected to 
fall two-thirds of the time if a very large number of the tests, 
equivalent in all respects, were given to the pupil. Thus, if a pupil's 
true score on a test is 70 and the standard error of this true score is 
4 then his true score would be expected to range from 66 to 74 
two-thirds of the time if additional forms of the same test were 
given to him (2 :414-415). It is highly essential that the student 
become sensitive to the ever-present error in all measurements, 

The sources of information concerning the reliability of standard 
ized achievement tests are the same as those for validity. 


ADMINISTRABILITY 


The administrability of evaluation devices refers to the ease and 
accuracy with which the directions to pupils and evaluator can be 
carried out, The requirements of good directions have already been 
listed in Chapter IX. Here we are concerned only with pointing out 
the need for concern with administrability in selecting an evaluation 
device, and with illustrating the variations in administrability which 
may occur. An example of an extremely valuable and reliable test 
which is difficult to administer is the Stanford-Binet Intelligence 
Test. "This test can be given to only one individual at a time, and 
requires approximately an hour for the individual and about an 
hour and a half for the examiner. The directions for the examiner 
are very detailed and complex; indeed, a university course of one 
semester is usually required for the training necessary for the proper 
administration of this test. In many tests the administration time is 


p 


consequently interpretation will not be committed, In speed tests, 
especially, slight errors in timing can make the norms and interpreta- 
tions completely worthless. 

Other tests require little supervision by the examiner; such “self- 
administering” tests require merely passing out test papers, giving 
the starting and stopping signals, and collecting the papers. The 


separated classrooms by examiners who differ greatly in testing 
ability are obviously very desirable, A further aid is to 

directions to the pupils from the directions to 
that the pupils will not be confused or distracted by 
relevant to their needs. This separation should be made by 
the directions to the examiner in a special manual containing 
the material on validation, reliability, and interpretation, 


i 


E 
H 
AH 


H 


i 


mum admi » the time required for giving a test should 
Preferably fit into the normal period of about forty 
minutes; tests requiring more time than this can be made more 
reliable and valid by this additional length, but only at the sacrifice 


of administrability. Many test publishers and testing services have 

had to shorten their tests to meet 

shortened tests must be more highly refined in content so as to 
and 


in validity should be for monetary reasons, 
Administrability may be increased by the provision of answer 
sheets for pupils’ ‘These answer sheets are easier to handle 


210 How to Evaluate 


complexity of the operation, the provision of schemes for checking 
the scoring and summation of responses, may all throw valuable 
light on the test’s administrability. Obviously, evaluation devices 
whose cost of scoring is, say, one dollar apiece—as is true of one 
widely used vocational interest inventory—may be severely limited 
in usefulness for many teachers. 


INTERPRETABILITY 


The interpretability of an evaluation device refers to the ease and 
meaningfulness with which scores may be derived and understood. 
The first step in interpreting a test is obviously applying the scoring 
key to obtain “raw” scores. Some of the considerations in maximiz- 
ing scorability have already been discussed in Chapter IX with 
respect to teacher-made tests. Obviously scoring is easiest when it 
requires merely the counting of simple marks or numbers. It be- 
comes more effort-consuming when special ratings and correction 
formulae must be applied. For a large-scale testing program the 
problem of scoring may assume considerable proportions. Numerous 
mechanical scoring devices have been invented to lessen this labor. 
(These are more fully discussed in Chapter XX.) Here we are 
concerned only with the factor of scorability as a consideration in 
test selection. 

After the "raw" score has been obtained, it must be given meaning 
in relation to other pupils. Tables of norms are usually provided 
for this purpose. The nature of these norms and the groups upon 
which they are based must be considered in relation to the kinds of 
interpretation required. Some norms enable comparisons of pupils 
of the same and different ages, or grades, or other types of groupings. 
Norms may be provided for each part of a test so that separate 
scores may be interpreted. Here again we are concerned only with 
the need for considering interpretability in selecting evaluation de- 
vices. A fuller discussion of the interpretation of evaluation devices 
is presented in Chapter XXI. 


SUMMARY 


We reviewed the considerations affecting the choice between 
standardized and teacher-made tests. Criteria in choosing a stand- 
ardized test are validity, reliability, administrability, and interpret- 


Choosing Standardized Tests 211 


ability. Validity is defined and its specificity and quantitative nature 
are stressed. The major types of criteria, curricular and statistical, 
are described and illustrated. The ultimate criterion, social utility, 
is related to its more specific manifestations. Reliability is defined in 
terms of accuracy of measurement. Methods of determining, factors 
affecting and standards for interpreting reliability are described, 
Administrability is defined and criteria for its estimation are given. 
Interpretability is discussed in terms of the ease of obtaining scores 
and using the norms for a test, 


QUESTIONS 


I. Since the standardized test seldom exactly fits the objectives of a 
given classroom, how should achievement of the objectives peculiar 
to that class be evaluated? 

2. Can a test be valid that successfully measures the achievement of 
a certain objective, if the objective itself has no social utility? 

3. How can a teacher determine the curricular and statistical validity 
of his own tests? 

4. What are some practical procedures for increasing the reliability of 
a teacher-made test? 

5. Why is the mere examination of test content insufficient in determin- 
ing what a test actually measures? Give instances of possible dis- 
crepancies between “face validity," or the validity apparent in test 
content, and actual validity. 

6. Examine, if possible, the discussions in a Mental Measurements 
Yearbook, of tests in a given field. On what grounds are tests most 
often adversely criticized? Praised? Which of the four criteria of a 
standardized test is apparently most easily satisfied? 

7. Why are test manuals issued by the publishers often not an adequate 
source of information concerning a test? 

8. Do the four different methods of estimating test reliability yield 
results having the same meaning? Or would reliability coefficients 
of the same size for the same test based on the same pupils have 
different meanings if they were determined by the four different 
methods? 

9. Examine four or more different standardized tests in a given field, 
obtaining them if necessary from their respective publishers, Ex- 
amine also the test manuals for details concerning the construction 
of each test, its validity and reliability for different purposes and 


212 


IO. 


II. 


How to Evaluate 


groups, and the directions for administering, scoring, and interpret- 
ing. Construct a table with separate columns for each of these aspects 
of the tests and tabulate the information for each test in turn. Using 
your table, choose the best test for your defined purposes and 
defend your choice. 

The statistical validity of a test can be no higher than the square 
root of its reliability. Consequently, what is the highest possible 
validity coefficient obtainable for tests whose reliability coefficients 
are, respectively, .49, .64, .8r, 1.00? 

According to the Kuder-Richardson method, what is the reliability 
of a test of 100 items, on which the mean score is 50 and the stand- 
ard deviation is 10? What is the reliability of a total test when the 
correlation between its halves is .60? 


RÉFERENCES 


Buros, O. K. (ed.), The Mental Measurements Yearbooks, Highland 
Park, New Jersey, 1938, 1940. j 
Guilford, J. P., Psychometric Methods, New. York: McGraw-Hill 
Book Company, Inc., 1936. 

Jackson, R. W. B., and Ferguson, G. A., Studies in the Reliability of 
Tests, Toronto: Department of Educational Research, University of 
Toronto, Bulletin No. 12, 1941. 

Kelley, T. L., Interpretation of Educational Measurements, Yonkers: 
World Book Company, 1927. 


. Kuder, G. F., and Richardson, M. W., “The theory of the estimation 


of test reliability,” Psychometrika, 2 : 151-160 (1937). 

Lindquist, E. F., and Cook, W. W., “Experimental procedures in 
test evaluation," Journal of Experimental Education, 1: 163-185 
(1933). 

Remmers, H. H., Karslake, Ruth, and Gage, N. L., "Reliability of 
multiple-choice measuring instruments as a function of the Spearman- 
Brown Prophecy Formula, I,” Journal of Educational Psychology, 
31 : 583-590 (1940). 

Remmers, H. H., and Sageser, H. W., “The reliability of multiple- 
choice measuring instruments as a function of the Spearman-Brown 
Prophecy Formula, V,” Journal of Educational Psychology, 32 : 445- 
451 (1941). 

Symonds, P. M., "Factors influencing test reliability," Journal of 
Educational Psychology, 9 : 73-87 (1928). 


MEME LLL LLLI LLLI EE SEITE 


CHAPTER XI 


Product and. Procedure Evaluation 


AS WAS NOTED IN CHAPTER VIII, NOT ALL ACHIEVEMENT OF INSTRUCTIONAL 
objectives can be expressed in terms of ‘language, in either words or 
mathematical symbols. Whether the curriculum is organized into 
subject-matter fields or in activities, projects, or other more “pro- 
gressive” ways, teachers will find short-answer tests of the type thus 
far discussed inadequate for comprehensive, well-rounded evalua- 
tion of all the objectives of instruction. The inadequacy of these 
tests for such objectives as “coherent and cogent composition” and 
“organized expression,” already discussed, was found to point toward 
the need for essay tests. But there remains a large sphere of educa- 
tional activity which is still untouched by either of these kinds of 
evaluation devices. This is the area in which a pupil’s achievement 
is expressed by means of a product, something that is a direct in- 
dication of his application of information, skill, and understanding. 

The purpose of this chapter is to make clear the nature and logic 
of the methods for product evaluation. What we mean by a product 
is perhaps best illustrated in the fields of industrial arts and home 
economics. In industrial arts instruction the most direct expression 
of the pupil’s ability to perform the skills taught and his acquisition 
of technical knowledge is usually in the form of some project, or 
product, such as a piece of woodwork or metalwork, or a mechanical 
drawing. Whether a pupil can handle woodworking tools and under- 
stand their functions and operation is indicated better by the prod- 
ucts he manufactures with them than by any verbal test. Chairs, 
tables, lamps, funnels, dustpans, or sets of bookends and the pupil 
behaviors resulting in these products must consequently be evaluated, 
rather than his responses on a short-answer test or his essay on how 
to make any of these things, 


213 


. 


214 How to Evaluate 


Similarly, in home economics the pupil’s understanding of food 
preparation and clothing construction must be evaluated in terms 
of pupil products, as well as by means of tests involving the 
linguistic expression of ideas. The quality of the foods, such as 
pies, salads, or meats, and of the clothing, such as pajamas and 
dresses, and of the selection and arrangement of home furnishings 
are at least as important an indication of the pupil’s achievement in 
these fields as scores on a verbal test. 

In other subjects products may similarly constitute an important 
aspect of achievement. In English, notebooks, compositions, and 
term papers are valuable indicators of achievement. In the social 
studies such as history, civics, economics, and geography, such prod- 
ucts of pupil endeavor as maps, tables, notebooks, and term papers 
must frequently be evaluated. Art instruction. which requires the 
pupil to make drawings and models similarly requires the evalua- 
tion of products. Mathematics instruction is perhaps the least in 
need of such evaluation, but even here the drawings and diagrams 
required of pupils should be subjected to the type of evaluation now 
being discussed. 

In the natural sciences, such as biology, chemistry, and physics, 
product evaluation is required for laboratory setups, specimen 
preparations, models, notebooks, precipitates, and general pro- 
cedures. In the elementary grades, handwriting is an important pupil 
achievement well evaluated as a product. 

We are therefore ready to take up the seventh. of the questions 
listed at the beginning of Chapter VIII: 

If a non-language product or behavior device is used, how shall 
it be constructed? 


Tue Construction or Propucr Evatuation Devices 


Products may be evaluated either (x) in terms of their component 
features or desirable characteristics, or (2) in terms of their unitary 
“general merit,” in which features are not regarded separately. 
Devices of the first type are called score cards or check lists. De- 
vices of the second type, which may also be applied to the specific 
features involved in the first type, are either rating scales or quality 
scales. When rating scales or quality scales are applied to the specific 


Product and Procedure Evaluation 215 


features of products, the evaluation device becomes a collection of 
scales, each feature requiring a separate scale. Let us now turn to 4 
discussion of the general procedure involved in constructing each of 
them. ; 

In their general nature, product evaluation devices serve as a 
means of systematizing and organizing judgments concerning the 
product. Perhaps this point is best clarified in terms of analogies ta 
short-answer tests. In a shortanswer test the pupil's “product” is 
his set of responses to the test items and this "product" is evaluated 
by means of the scoring key. The products with which we are here 
concerned are, however, far more complex and detailed than re- 
Sponses to short-answer test items. Consequently the evaluation 
device, analogous to the scoring key, must be analytical in nature 
So as to reduce the single complex product to a series of more simple 
features analogous to test items. In other words, just as the short- 
answer test by its construction enforces the analysis of a complex 
achievement into a number of simpler, more unitary test item re- 
Sponses, so the product evaluation device must furnish an analysis 
of complex, multi-featured products into a number of more unitary 
features each of which can then be separately considered. and 
evaluated. The total evaluation of the product is thus some func- 
tion, in practice usually the sum total, of the evaluations of the 
separate features of the product. 

Furthermore, just as each item in the short-answer test must be 
scored, usually either right or wrong, so each feature of the product 
must be given a score, usually, as we shall see, along a multi-valued 
continuum. The construction of a product evaluation device thus 
resolves itself into (1) the analysis of the product into specific 
features, and (2) the provision of various levels of quality for 
Scoring each feature. 

Before taking up each of these steps it may be well to point out 
explicitly the distinction between "tests" and "scales." The "test" is 
an evaluation device which both brings forth pupil behavior as 
Tesponses to test questions and scores or evaluates that behavior, 
The “scale,” on the other hand, is a set of samples or specimens of 
Pupil behavior or of its products, arranged in order of merit, diffi- 
culty, or rarity, with which pupil behavior is to be compared. The 
scale is more strictly a scoring device than is the test. 


216 How to Evaluate 


Analysis of the Product into Specific Features—The features 
into which a product is analyzed should first of all reflect the 
instructional objectives at which the training in making the product 
is aimed. Here again it is seen that evaluation must proceed in close 
touch with instructional objectives. Products should not be evaluated 
in terms of features that are irrelevant to the skills and abilities 
which the product has been designed to require of pupils. For 
example, a product should not be evaluated on the basis of the 
feature “neatness” unless neatness has been a definite objective of 
instruction with respect to it. That is, there must be a definite and 
explicit relationship between the features on the basis of which the 
product is evaluated and the instructional objectives which constitute 
the reason why a pupil has made the product. 

A second consideration determining the features into which a 
product should be analyzed is the amenability of the feature to 
evaluation. Only such features should be considered as are suffi- 
ciently explicit, definite, and unambiguous so that competent judges 
will tend to agree in their evaluations of them. This point is il- 
lustrated by the results of attempts to construct rating scales for 
personality traits. It has been found that there is close agreement 
among judges on such traits as quickness, efficiency, perseverance, 
scholarship, and leadership, but that, on the other hand, they 
agree very poorly with one another on such traits as tactfulness, un- 
selfishness, courage, and integrity. It is obvious that for products 
whose features can be evaluated by physical measurements, as 
with foot rules or thermometers, there will be greater amenability 
to evaluation in the sense of greater agreement among judges than 
there will be for such relatively subjective features as “decorative 
value,” grace, or general merit. It should be realized, however, be- 
fore regarding a feature of a product as too intangible for evalua- 
tion, that many such features can be made scorable by means of 
techniques, such as the quality scale, to be discussed below. Further- 
more, the analysis into features must not concentrate on easily 
scorable properties at the expense of more intangible properties that 
may provide a more fundamentally important approach to the ob- 
jectives of instruction. Validity must not be sacrificed to scorability. 
In many instances the teacher will be required to make a fine 
balance in the relative emphases placed upon the importance and 


Product and Procedure Evaluation 217 


the scorability of features, Obviously, the best analysis of a product 
into features requires insight into both the instructional objectives 
and the requirements of valid, reliable measurement. 

The features into which the product is analyzed should be 
grouped on some basis that will increase the efficiency of applying 
the evaluation device and provide a more meaningful judgment of 
the product. The features may be arranged in the order in which 
the product is examined, or the arrangement may be in terms of 
general and specific features. Newkirk and Greene (16 : 151-152) 
recommend that the features be grouped into classes according to 
the method of judgment to be used. Their examples of analyses 
of industrial arts products grouped according to method of judg- 
ment are as follows: 


Woodwork Drawing 
Inspection Inspection 
Utility Neatness 
Design Placement 
Proportion Arrowheads 
Finish 
Physical Measurement Physical Measurement 
Squareness Circle 
Dimensions Accuracy 
Dimensions 
Rating Scale or Inspection Rating Scale or Inspection 
Nailing Lettering 8 
Screw joints Lines 
Glue joints Numbering 
Wood filing 
Sawed edges 
Plane edges 
Sanding 


Another method of arranging the features, which is especially 
valuable when the product evaluation device is used also for 
instructional purposes, is in an order which is the same as that in 
which they emerge during the development of the product itself, 
Some features are best judged before the product is completed be- 
cause they may be obscured or altered by subsequent operations or 
Parts of the product. 


218 How to Evaluate 


The analysis of a product into features should also, if possible, 
result in the weighting of each of the features according to its im- 
portance as a determiner of the total merit of the product. The 
total of the weightings may equal some such figure as 100, with each 
of the component features contributing a proportion or percentage 
of this total in accordance with its importance. An illustration of 
this from a somewhat different field is the Otis Score Card for rat- 
ing standardized tests in which, out of a total possible 100 points, 
validity is given 20, reliability ro, the test manual 7, ease of ad- 
ministration 20, etc. In this way account is taken of the relative 
importance of the various features of the product which is being 
evaluated. This consideration, however, leads us to the discussion 
of the next step in constructing a product evaluation device, since 
it is concerned more with the scoring of single features than with 
the analysis by which features are discovered, 

But before proceeding, it may improve understanding of the 
analyzing process to call attention to its parallelism with earlier 
discussions in this volume, namely, the discussion of the statement 
of instructional objectives in Chapter II, of the table of test specifica- 
tions in Chapter X, and of the ultimate criterion of validity .in 
Chapter I. It should be realized throughout that the content of 
evaluation depends upon the purposes or objectives of the endeavor 
being evaluated, 

Provision for Scoring Specific Features of a Product.—The prob- 
lem of scoring specific features may be considered to resolve itself 
into two related subsidiary problems: (1) the number of points or 
values which it is possible for the feature to have, and (2) the 
description and definition of the various points along the scale or 
continuum on which the feature is scored. Obviously, the simplest 
solution for both of these problems is twofold scoring, by which 
the feature can have one of only two possible values, such as 
"present or absent" "good or bad." This converts the product 
evaluation device into a simple check list, each of the features being 
checked according to whether it is present or absent in the product. 
The total score for the general merit of the product is the total 
number of desirable features which it possesses, as represented by 
the number of check marks. Although this method has the advan- 
tage of simplicity and ease of application, it violates the principle 
that most features are not completely present or absent but rather 


Product and Procedure Evaluation 219 


are possessed by the product in degrees varying between these two 
extremes. Consequently twofold scoring usually involves a con- 
siderable loss of refinement, accuracy, and reliability in the evalua- 
tion. 

This disadvantage is circumvented by the presentation of several 
levels of quality of the feature under consideration. The number 
of levels or steps or scale units or alternatives which should be 
presented for each feature depends on the fineness of discrimination 
possible for that feature. Thus features which can be objectively 
measured, such as accuracy of dimensions, may be evaluated on a 
large number of levels of quality, while more subjective features, 
such as neatness or legibility, should be evaluated on fewer levels. 
In general, the larger the number of levels or the finer the dis- 
crimination possible, the more reliable will be the resulting evalua- 
tion or measurement. Consequently the number of levels presented 
should be as large as amenability to discrimination permits, so as 
to maximize reliability. With human personality traits it has been 
found (21:79) that the reliability of rating reaches its maximum 
when about seven levels of discrimination are presented for each 
trait, more than seven yielding no increase in reliability and less 
than seven leading to an appreciable sacrifice of reliability. But if 
the various levels are defined by means of scaling techniques which 
assign scores to various products chosen to illustrate different levels 
of quality, more levels may be used. The Thorndike Handwriting 
Scale thus presents fifteen samples of handwriting, each sample 
defining some point along the range from poorest to best in “general 
merit," 

In what way should the various levels for each feature be pre- 
sented? In the case of twofold scoring, this question is answered 
by simply defining the two levels as "present or absent" or “good or 
bad." For three or more levels an intermediate term may be inserted 
such as “excellent, average, poor.” Each of these terms may be given 
a numerical value such as 3, 2, 1. 

A modification of this device is to furnish a graphic scale in the 
form of a straight line whose ends and intermediate points are 
properly labeled, as follows: 


——MMM MM oo Ug 
Very poor Average Very good 
1 2 3 


220 How to Evaluate 


In making his rating the judge places a check mark at the point on 
this line which, to his mind, represents the proper position in rela- 
tion to the merit of the feature. This mark is then given a score 
proportional to the linear distance from the “poor” end of the line, 
say the number of “half-inches” from that end. 

Instead of general descriptive adjectives, the points along the 
graphic scale or the levels of quality may be defined in verbal terms 
specific to the feature concerned. The Minnesota Score Card for 
Meat Roast presents the levels for the different features in this way. 


FOOD SCORE CARDS 
(Devised by Clara M. Brown and others) 


Meat Roast 
I 2 3 Score 
APPEARANCE 1. Shriveled Plump and 
slightly moist — x. 
Coron 2. Pale or burned Well browned 2. 
Moisture Content 3. Dry Juicy 3. 
TENDERNESS 4. Tough Easily cut 


or pierced 
with fork 4. 


a A PAA E SUC EP Empoli ROI ug 
Taste AND Fravor 5. Flat or too highly Well seasoned 5.— 


seasoned 
6. Raw, tasteless, or Flavor 
burned developed 6. 


oe 


Applying the Psychophysical Methods.—If a product is one which 
must be frequently evaluated and will be used for many different 
classes over a period of years, as will such products as handwriting, 
term papers, specific food products, then it may be worth while to 
apply one of the psychophysical methods: paired comparisons, rank 
order, or equal-appearing intervals. Each of these somewhat dif- 
ficult and involved methods provides a means of defining the levels 
of quality for specific features of a product in such a way that 


| 


Product and Procedure Evaluation 221 


rather accurate psychological values for quality, or merit, may be 
assigned to each product. 

It should be kept in mind that each of these methods has most 
frequently been applied in scaling the general merit of various 
products taken as a whole, or in scaling single features of various 
stimuli, since each of them requires such labor that its application to 
more than one feature of a product, or to the scaling of all the 
features into which a product has been analyzed, is usually im- 
practicable. Thus, although the discussion of these methods here 
Comes at a logical point in our treatment of the methods by which 
the various levels of different features may be defined, it is probable 
that these methods will be more useful in connection with the 
construction of general quality or merit scales, or scales of affective 
(emotionally toned) values, in which a product or stimulus is con- 
sidered from one general point of view. Let us now briefly describe 
each of these methods and discuss its relative merits in various 
situations. 

1. The method of paired comparisons requires that each product 
be judged in turn as better or worse, with respect to the feature 
under consideration, than every other product in the group. The 
number of comparisons to be made increases rapidly as the number 


(n — 1) 
2 


1 dr n 
of products, n, increases, because it is equal to . For ex- 


ample, if there are 30 pupil products, the number of comparisons 
30 X 29 


z OF 435 The large number of comparisons 


by this method is 


required makes it wearying for the judges who do the comparing. 
For this reason it is desirable to reduce to a minimum the number 
of pairs of products which must be compared. This can be done 
by either of two methods: (a) Select from all the products in the 
classroom a limited number to become the standard for the scale, 
These should be chosen at what seem to be equal intervals along 
the scale and the quality of their features should be as unambiguous 
as possible, The method of paired comparisons may then be applied 
to these standard values which are fewer in number than if all 
the products of a given class were used. (b) Establish the approxi- 
mate rank order of all the products (either by the method of equal- 


222 How 10 Evaluate 


appearing intervals or any other rough method) and then obtain 
paired comparisons only between neighboring pairs or with a 
limited number of neighbors on either side of each product, 

Several methods of treating the proportion of preferences for each 
product so as to obtain scale values have been proposed. We shall 
here refer to only the simplest one, that proposed by Guilford 
(8 :236-238). For details on the procedure of deriving scale values 
by this method we refer the reader to Guilford's volume, 

The use of these scale values is perfectly straightforward. Any 
product which is being evaluated is compared or matched with the 
products whose scale walues have been determined, with reference 
to the feature under consideration. The product being evaluated is 
given the scale value or score of the product to which it is most 
similar with respect to that feature. 

The number of judges used in determining scale values should be 
as large as possible, never less than twenty. They should be qualified 
teachers and pupils who are highly familiar with the type of com- 
parisons required and with what is to be desired in any feature of 
the product. The best standard scales are based on the judgments 
of from x00 to 300 experts. 

2. The method of rank order Tequires that the products be ranked 
in serial order in accordance with the evaluator's judgments of their 
merit with respect to the feature under consideration. The mean 
rank for each product is computed for all judges. (The mean is the 
sum of the ranks given by all judges divided by the number of 
judges.) These mean rankings give merely the final rank order 
representing the pooled judgments of a number of evaluators and 
cannot be assumed to be scale values unless the products are about 
evenly scattered over the continuum of quality for the specific feature, 
with no piling up or grouping at any place. When the products 
are not so distributed but rather tend to pile up around the middle 
of the range of values so that more are average in merit and fewer 
are at the extreme good or the extreme bad end of the continuum, 
the average rank cannot stand for units on a scale. Guilford (8 :250- 
251) has presented a relatively simple method of deriving scale 
values from pooled judgments of rank order, 

With the scale values thus derived, a number of products may be 


Product and Procedure Evaluation 223 


chosen to represent equal steps along the continuum of quality for 
a given feature of a product. The scale is then used by the teacher 
in the same way as the scale derived by the method of paired com- 
parisons, that is, by matching the product to be evaluated with the 
various products whose scale values have been derived, 

3. The method of equal-appearing intervals requires the group 
of judges to sort the products into a group of piles, usually nine 
or eleven in number, so that the difference in apparent quality of 
the product with respect to the given feature is the same from one 
pile to the next. Then the number of times each product occurs in 
each pile for all judges is determined and the median score for that 
product is obtained. The median is the point below which lie half 
the cases. At the same time a measure of the spread of the piles 
into which each product was placed by the judges is obtained, 
usually in terms of the quartile deviation. (The quartile deviation 
is half the difference between the product score below which 
75 per cent of the products ranked and the product score below 
which 25 per cent of the products ranked.) Scale values are then 
determined for each product as the median pile number into 
which it was placed by all the judges. But if a high quartile devia- 
tion is found for a product, it is rejected as a definer of a scale 
value because this means that the judges were not able to agree 
well with one another on what scale value that product deserved. 

This method requires that a large number of products and many 
raters be used in the first sorting of products, since some products 
will be rejected as having too wide a dispersion over the scale, and 
a large number of judges is necessary to make the results reliable. 
Also the preliminary samples used should cover the entire range of 
quality from very poor to very good. If a large number of sample 
products of widely varying quality are used in the preliminary 
Sorting, much greater room for the selection of the most un- 
ambiguously scaled products at evenly spaced intervals is afforded, 
Thurstone (25) has used this method in the construction of attitude 
scales so that the distance between one statement of an attitude and 
the others may be readily determined and a person’s attitude defined 
as the average of the scale values of the statements which he en- 
dorses, Similarly, Hollingworth (1o) classified jokes into ten cate- 


224 How to Evaluate 


gories as to the degree of humor by this method and found the 
procedure quicker, less monotonous, and less fatiguing than the 
ranking method. 

Comparison of the Three Methods—The present extremely 
simplified discussion has been only a superficial overview of these 
methods; for a more extended discussion the reader is referred to 
Guilford (8). The labor involved in any of them is so great as to 
warrant their application only to products and procedures which 
will need to be evaluated again and again. The greater frequency of 
use of the evaluation device will render the investment of effort 
worth while. 

The ranking method is probably easiest to apply but is not suitable 
when many products must be scaled. The paired comparisons tech- 
nique is perhaps the most accurate but is also the most laborious. 
The rating, or equal-appearing intervals, technique may thus in- 
corporate the best combination of applicability, validity, and prac- 
ticability. The rating scale method to which we have already referred 
proceeds, in assigning scale values to products, in accordance with 
what has here been treated as a method of equal-appearing intervals, 
It thus appears that this method, while not so accurate or valid 
under many circumstances, still provides the only usable, practicable 
technique in certain instances. Consequently, we find ourselves re- 
turning to the use of rating scales as a method of defining by means 
of illustrative products the steps along the “parent” rating scale 
which must then be used to evaluate a large number of other 
products with respect to a defined, specific feature or to general 
merit “as a whole.” 


Speciric Propucr Evatuation Devices 


It will contribute to the reader's understanding of the preceding 
discussion to become familiar with some of the rating scales, score 
cards, check lists, and quality scales which have been developed in 
various fields, and with some of the results which have been ob- 
tained through their use. Let us begin with handwriting scales, since 
these come first both historically and in the educational ladder. Then 
will be discussed, in order, the evaluation of lettering, drawings, 
food products and procedures, clothing, industrial arts, compositions, 
and laboratory products and procedures, 


Product and Procedure Evaluation 225 


Handwriting.—Thorndike's handwriting scale, developed in 1910, 
provided the first scientifically derived “means of measuring the 
quality of a sample of handwriting.” The condition which he 
sought to ameliorate is described as follows (23 : 1): 


At present we can do no better than estimate a handwriting as very 
bad, bad, good, very good, or extremely good, knowing only vaguely 
what we mean thereby, running the risk of shifting our standards with 
time, and only by chance meaning the same by a word as some other 
student means by it. We are in the condition in which students of 
temperature were before the discovery of the thermometer or any other 
scale for measuring temperature beyond the very hot, hot, warm, luke- 
warm, and the like of subjective opinion, 


Two methods were used by Thorndike in constructing his Scale 
for Quality of. Handwritings of Children in Grades 5 to 8: the 
method of equal-appearing intervals, checked by the method of 
paired comparisons. The two methods did not give results that 
agreed exactly. He assigned to his samples of handwriting the values, 
7. 8, 9, 10, 11, 12, 13, 14, 15, 16, and 17; the sample value 14 is as 
much better than 13 as 13 is than 12 as 12 is than 11, etc, Several 
samples were furnished for some levels of quality, the samples being 
of equal merit but in different styles. The scale is used by putting 
the specimen of handwriting to be evaluated alongside the scale and 
determining to what point in the scale it is nearest. Thorndike was 
confident concerning the increase in reliability made possible by the 
scale: "Observers will disagree in their measurements made with the 
scale but not nearly so much as in measurements made without it" 
(23 :44). Not all quality scales, however, have resulted in such 
increased reliability of measurement as Thorndike prophesied. 

The Ayres Handwriting Scale (4 : 46-57) substituted “legibility” 
for Thorndike’s “general merit” as the criterion of quality, on the 
grounds that legibility is both more functional and more objectively 
scorable, as the time required for reading, than general merit, 
One thousand five hundred seventy-cight preliminary samples 
of handwriting were secured by asking pupils to write a list of 
words thrown out of context so as to require every word to be 
read. These samples were then read by ten competent paid assistants 
who recorded, by means of a stop watch, the exact time it took to 
tead each sample. The average time required by the ten readers for 


MEASURING SCALE 


"This scale for measuring the quality of handwriting is a revised edition of a scale 
first published in 1912 and subsequently reprinted 12 times with several minor 
visions and with a total of 62,000 copies. The purpose of the changes introduced. 
the present edition is to increase the reliability of measurements of handwri ing 
through standardizing methods of securing and scoring samples, and through 
making numerous improvements in the scale itself designed to reduce variability in 
the results secured through its use, The present scale may be referred to as 
‘Gettysburg Edition’ in order to distinguish it from other editions. The original or 
‘Three Slant Edition’ and the scale for adult handwriting are not superseded by the 
present scale. Copies of any of the three scales may be secured for ten cents each, 


postpaid. 


each sample was computed, and the rate in words per minute by 
dividing the average time by the number of words in the sample, 
To counteract increases in reading speed due to practice the first 
75 papers were reread at the end of the 1578 samples, and new times 
were recorded. The samples were then classified into five types of | 
style: vertical, medium slant, extreme slant, backhand, and mixed. 
The scale values of the samples were obtained by arranging them 
in order of legibility, or speed with which read, and picking out 
those which fell at tenths of the whole series. That is, the sample 
below which 20 per cent of the samples fell was given the scale 
value 20, that below which 30 per cent fell was scaled at 30, and 
so on. The rates of reading of the eight samples thus chosen were 
131.2, 149.4, 163.5, 175.7, 186.1, 195.8, 202.9, and 209.6 words per 
minute, respectively. It is apparent that the differences in time rate 
become progressively smaller in proceeding from the worst to the 


226 


FOR HANDWRITING 


"To secure samples of handwriting the teacher should write on the board the 
first three sentences of Lincoln's Gettysburg Address and have the pupils read and 
copy until familiar with it, They should then copy it, beginning at a given 
signal and writing for precisely two minutes, They should write in ink on ruled 
paper... 

"To score samples slide each specimen along the scale until a writing of the same 
quality is found. The number at the top of the scale above this shows the value of 
the writing being measured. Disregard differences in style, but try to find on the 
scale the quality corresponding with that of the sample being scored, . . .” 

(These portions of the scale are reproduced by permission of the Russell Sage 
Foundation, the publisher of the scale.) 


best sample, as would be expected from realizing that the gain in 
legibility grows smaller and smaller as handwriting improves, "The 
scale illustrated on this and the preceding page is used in the same 

way as the Thorndike Scale. 
| Freeman (5) developed a Chart for Diagnosing Faults in Hand- 
d writing which enabled the scoring of a specimen of handwriting on 
five features: uniformity of slant, uniformity of alignment, quality of 
line, letter formation, and spacing. For each of these features a three- 

step product scale is provided, 
Lettering.—The evaluation of free-hand lettering presents prob- 
lems similar to those of handwriting. One of the earliest attempts 
at a scale for measuring free-hand lettering was made by Rugg (20) 
in 1915. The judgments used in constructing his scale were a com- 
bination of (1) the “general impression” of uniformity of lettering, 
and (2) the number of correct heights, spaces, stems, and ovals in 

227 


228 How to Evaluate 


each of the lettering samples; the final grade of each sample was 
computed by means of the equation: 
A= ki H+ ka Sp + ka St + hy Ov 
where A = ability to letter as shown by a given sample of 
heights, spaces, stems, and ovals, respectively 
H, Sp, St, Ov = percentages of heights, spaces, stems, and ovals of 
letters found to be correct in the whole sample 
Ru kz ks, ka = constants depending on the weights to be given to 
the four elements in lettering, determined sub- 
jectively according to effect of each element on the 
“good appearance” of the lettering. 
ki = ka= 33; ks = ha = 217 
By this laborious method, involving counting and judging the ac- 
curacy of more than 100,000 single strokes, with the necessary ac- 
companying tabulation and computation, Rugg obtained a scale of 
eight samples ranging by intervals of approximately ro per cent 
each, from 100 per cent to 30 per cent. 

In 1928, Quackenbush (18) presented a hand-lettering scale con- 
structed by an approximation to the method of equal-appearing 
intervals. First, a large number of samples were collected. Second, 
these samples were sorted into tentative grades by several different 
individuals, Third, the whole group was sorted into vertical piles, 
each pile containing all the samples which fell within the limits of 
that pile, i.e, those coming between 70 and 79, and so on. Fourth, 
these piles were examined to see if all the cases in each one were of 
the grade assigned to the pile. After elimination of doubtful samples, 
the remaining samples were arranged in a tentative scale with at 
least three samples at each step upon whose grade there was perfect 
agreement. Then further student samples were graded with the 
tentative scale by six instructors, and from the whole group were 
selected three samples of each grade upon which the instructors 
were unanimous. The students who had made the chosen samples 
were then requested to furnish an exact duplicate of the first sample. 
"These samples, from two different students at each level, constituted 
the final scale, 

Miller (15) has used still another method in constructing a letter- 
ing scale. Ninety specimens were selected from the work of students 
and professionals as a basis for detailed study. Of these, eight were 


Product and Procedure Evaluation 229 


used as key samples. The remaining specimens were roughly classi- 
fied into several sets by merit, and each set was compared with one 
or more of the key samples, This technique was used rather than 
that of ranking the specimens in order of merit to avoid the lack of 
discrimination at the middle of the scale which occurs when a large 
number of specimens must be ranked, By employing cross-ratings 
Miller hoped to eliminate any bias caused by his particular grouping 
of specimens. After many judgments had been obtained, the per- 
centages of times that any one specimen was preferred to a nearly 
related sample were computed and then converted into scale values 
obtained from the Fi ullerton-Cattell tables for the Perception of Small 
Differences (6). It is clear that Miller's technique was a form of the 
method of paired comparisons. 

Free-hand Drawing.—To Thorndike (24) again belongs credit 
for the first product scale for drawing. He constructed it by securing 
rankings of fifteen children’s drawings, already selected for suit- 
ability, from 376 raters, 60 of whom were artists listed in Who's Who 
in America, 80. supervisors of art teaching, and 236 students of 
education and psychology. The difference in merit between any two 
drawings was taken as the percentage of judges who judged each 
drawing superior to the other. Thus if 9485 per cent judged B 
superior to A, and 84.5 per cent judged C superior to B, then the 
B — A difference is greater than the C — B difference, While the 
exact relationship between “units of difference" and percentages of 
preferences is not known, Thorndike decided to call the difference 
judged similarly by 75 per cent of the judges (and otherwise by 
25 per cent) equal to r.oo. A difference one-tenth as great, or 0.10, 
would be judged similarly by 52.69 per cent, two-tenths as great by 
55:36 per cent, etc. Likewise, given percentages of similar judgments 
or preferences, the units of difference may be obtained. The necessary 
figures are given in Table 5. 

After the unit differences between. successive pairs of products, 
ie, a—b, b—c, c— d, etc., are obtained, the differences are suc- 
cessively added to the lowest product, which is considered as zero, 
The successive sums are the scale values of the respective products. 
It should be apparent that Thorndike here used a technique for 
combining order-of-merit judgments, 

By this method Thorndike derived his Scale for the Merit of 


230 How to Evaluate 


Taste 5.—The Amounts of Difference [CE] Corresponding to Given Percentages 
of Judgments That x > y. %r = the Percentage of Judgments That x > J: 
A/P.E. = x — y, in Multiples of the Difference Such That %r Is 75 


%r A/P.E. gr A/P.E. 9r A/P.E. %r A/P.E. Sr — A/P.E. 


$0  .000 6o ..376 70 778 80 1.246 go 1.900 


$1 .037 [220 771 71. Bı 81 1.300 gr 1.987 
52 -074 62 453 7» -865 82 1355 9. 2.083 
EJ 63 49 73  .909 83 r4n 93 1.188 
$4 149 64 53 74 6954 84 1.472 94 24305 
55 386 65  .571 75 1.000 85 1536 95 2.439 
$6 224 66  .6n 76 1.046 86 1.601 96 2.596 
57 a6 67 653 77 1.094 87 1.670 97 2.790 
58 299 68  .694 78 i14 88 174 98 3:045 
359 337 69 .736 79 1194 89 1.818 99 34450 

995 3818 

9975 4166 


m m eee Dec Rte Hits. DM ue n e LN ANE V | 


Drawings by Pupils 8 to 15 Years Old. It contained thirteen draw- 
ings whose scale values were, in order, 2.4, 3.9, 577, 6.5, 7.8, 8.6, 10.5, 
11.8, 12.6, 13.5, 14.4, 16.0, and 17.0. It should be remembered that 
the unit of these scale values is the difference between products 
whose direction 75 per cent of the judges agreed in perceiving. 

Another drawing scale is that by Kline and Carey (13). This scale 
differs from Thorndike’s in presenting four scaled series of draw- 
ings, each ona separate subject: house, rabbit, tree, and boy running. 
In this way the authors seek to overcome the disadvantage of Thorn- 
dike’s scale resulting from its dealing with a variety of subjects. 
This variety makes it difficult to determine the exact value of any 
drawing by comparing it with a standard on its own subject. Also 
the range of applicability of the Kline-Carey scale is greater, ex- 
` tending from the primary kindergarten group to the high school 
senior level. 

In order to secure finer discrimination for the primary kinder- 
garten group and to realize upon the advantage of including several 
subjects for the drawings at each level of quality, McCarty (14) 
constructed another drawing scale. Three subjects were selected in 
accordance with evidence of children's major drawing interests: 
persons, houses, and trees. The subject of trees served as the basis for 


Product and Procedure Evaluation 231 


requiring the drawings to be original compositions. The pictures 
were to be judged not for their artistic fecling or aesthetic merit, 
but rather for their accuracy, clarity, and vividness in representation. 

The 1070 or more drawings on each of the three subjects were 
sorted and sifted three times, and each time the drawings found too 
variable in rating or unnecessary for representation of their scale 
value were eliminated. The first sorting was done by the method of 
equal-appearing intervals; the second and third sortings were made 
by Thorndike’s process with the rank-order method. By the third 
sorting the number of drawings on each subject had been reduced 
to thirty-four. After the third rating by many judges, the three scales 
for drawings of human beings, houses, and trees contained 16, 16, 
and 12 drawings, respectively. 

Tiebout (26), requiring a drawing scale emphasizing aesthetic 
qualities rather than representation, constructed another device. 
Finer discrimination than that possible with the Kline-Carey scale 
was also desired, so that separate scales were constructed for each 
of the grades from one through seven. Also colors were used in the 
paintings, unlike other scales. As the result of a preliminary sorting 
of 2227 paintings by five experts, thirty were chosen for each grade 
50 as to represent, in the judges’ opinions, five degrees of artistic 
quality. Small groups—fourteen to seventeen—of highly qualified 
judges rated these thirty paintings by arranging them in each grade 
in five ranks of artistic merit, giving due consideration to the attain- 
ment of rhythm, balance, unity, and other aesthetically significant 
qualities. The method of equal-appearing intervals was then applied 
and nine-step scales for each grade were derived. It is notable that 
the reliability of the judgments of the paintings was found, in several 
ways, to be always above .go, indicating the high consistency of the 
rating method and the judges, 

Some evidence concerning the advantages accruing from the use 
of drawing scales has been obtained by Brooks (2). He found that 
the use of either the Thorndike or the Kline-Carey scales decreased 
the inaccuracy of ratings to approximately one-half of what it was 
when no scale was used. Furthermore, the Thorndike scale yielded 
more stable results than the Kline-Carey. 

Food Products and Procedures.—In this field the score card, 


232 How to Evaluate 


check list, and rating scale have been more emphasized than quality 
scales, probably for the reason that food products are too perishable 
to be embodied in a permanent set of scaled samples. 

In the Check List for Student Performance in Dining-Room 
Waitress Service shown below, each feature is described verbally, but 
the levels for all features of performance are presented in one gen- 
eral form. 


CHECK LIST FOR STUDENT PERFORMANCE IN DINING- 
ROOM WAITRESS SERVICE 


(Adaption of form developed at Rochester Athenaeum and 
Mechanics Institute, Rochester, New York) 


Directions for rating: In the blank in front of each item write the - 
‘number which represents the level you think describes the person’s 
performance. 


Key E 

1—Usually below minimum standard acceptable in institution dining | 
room 

2—Occasionally falls below minimum standard : 

3—Meets minimum standard, acceptable to patrous lacking in dis- . 
crimination 

4—Meets maximum standard, acceptable to patrons desiring highe — 
grade service 


Personal Appearance and Neatness 


. Uniforms clean and well pressed 

. Hair dean, tidy, held in place with net or band 

. Shoes—low heels; clean; heels straight 

. Hose—no holes or runs; seams straight 

. Body—no odor 

. Cosmetics becomingly applied and used in moderation . 

. Posture erect but not tense 

. Hands and fingernails clean and well manicured 

. Hands washed after using handkerchief, arranging hair, or 
visiting lavatory 


Preparation Before Guests Arrive 


. "Temperature, ventilation, and light in room satisfactory 
- Tables correctly and suitably set 


Product and Procedure Evaluation 233 


Table Service . 
- Loading of trays in kitchen done quickly and without in- 
conveniencing others 
- Food service correct and rapid 
- Needs of guests recognized and cared for 
- Loading of trays with soiled dishes done quickly and quietly 


Dismantling Table After Guests Leave 


. Returned food properly cared for 
+ Soiled and wet linen properly cared for 
- Dining and serving tables cleared and in order 


Salesmanship 


. Helpfulness 
. Courtesy 
- Diplomacy in meeting complaints and unusual situations 


Clothing.—For clothing evaluation both quality scales and check 
lists have been developed. Murdoch in 1919 developed the first 
product evaluation device in any field of home economics, Her Sew- 
ing Scale is described by Brown (3 : 443-444). 

Other clothing evaluation devices are the Murdoch Analytic 
Sewing Scale for Measuring Separate Stitches, the Stiebling- 
Worcester Chart for Diagnosing Defects in Buttonholes, the Winn 
Analytic Sewing Scale, the Hickey Checklist for Fitted Facings on 
Garments, and the Score Card for Judging Garments of 4-H Cloth- 
ing Project I. All of these are discussed by Brown (3). 

Industrial Arts.—Newkirk and Greene (16) have presented a 
Rating Scale for Mechanical Drawings which contains an analysis 
into 45 features, each of which is verbally defined at one level and 
is scorable at ten levels. A Suggestion concerning the use of quality 
scales for judging such features as lettering figures and lines is in- 
cluded in the Directions for Use. These authors also discuss in detail 
the construction of quality scales by the order-of-merit method, 

Compositions—The first quality scale for compositions was de- 
veloped by Hillegas (9) in 1912. Similar in derivation and use to 
the Thorndike handwriting scale, it was handicapped by its failure 
to provide separate scales for each of the four types of composition: 
narration, description, exposition, and argumentation. To remedy 


234 How to Evaluate 


this defect, Ballou (x) presented in 1914 the Harvard-Newton 
Composition Scales, consisting of four separate scales for each type 
of composition at the eighth-grade level. Each scale contains six 
compositions actually written by eighth-grade pupils, ranging by 
approximately equal steps from the best to the poorest likely to be 
produced by eighth-graders, with letter grades assigned from A to F. 
Each sample is supplemented for practical purposes by a short 
description of its merits and defects, and is compared with neighbor- 
ing compositions on the scale. These scales are probably faulty, how- 
ever,.in that the scale values were derived from percentage gradings 
of each sample by teachers rather than by some method such as 
equal-appearing intervals. As a result, the values cannot be con- 
sidered to be expressed in equal units of merit, and the differences 
in merit between samples are difficult to interpret. 

In 1917, Trabue (27) published a supplement to the Hillegas 
scale designed to eliminate such objectionable aspects of the latter 
as its use of artificial samples, the extreme brevity of some samples, 
the unequal intervals of merit between samples, and the variations 
in type of composition and topic. The supplement, derived by the 
use of ratings with the Hillegas scale, presents ten compositions, 
seven on the subject, “What I Should Like to Do Next Saturday,” 
and three on other topics. The unit of quality, like that in Thorn- 
dike's drawing scale, is the difference in quality that was recognized 
by exactly 75 per cent of the original judges and not recognized by 
the other 25 per cent. 

Willing’s Scale for Written Composition in Grades IV to VIII 
(30) was based on the ranking of average ranks assigned by twelve 
judges to 63 samples. Eight samples were chosen from the 63 so as 
“to represent the extremes of this (assumed) normal distribution 
and six equally separated points within.” Consequently, the eight 
chosen ranks were: 1, 7, 14, 26, 40, 51, 59, and 63. These ranks repre- 
sent a compromise choice between proper placement in the dis- 
tribution and minimum average variation of the judges in ranking it. 
"The scale includes only narrative compositions and takes account of 
the correlation of rhetorical and formal qualities in the samples by 
including for each scale value the number of mistakes in spelling, 
punctuation, and syntax per 100 words. 

The Minnesota English Scales (29) attempted to provide separate 


p—M— 


Product and Procedure Evaluation 235 


measures of thought content, sentence and paragraph structure, and 
mechanical perfection. Three types of composition—exposition, nar- 
ration, and description—were also included, each type being on a 
fixed topic. Each sample of each type was judged three times by the 
judges, once for each of the three features, The "Thorndike process 
was applied to the resultant pooled rankings so as to obtain units of 
difference in merit based on a 75 per cent preference as unity. By 
using the three scales it was hoped that themes could be graded 
for specific qualities as accurately as general merit could be graded, 
However, the three features are not evaluated in equivalent terms in 
the same scale, but each feature of each scale furnishes practically 
an equivalent scale for the same feature in each of the other two 
types of discourse. Thus, a 52 in Structure is not equivalent to a 52 
in either Thought Content or Mechanics within the same type of 
discourse; but a 52 in Structure on any one of the scales is almost 
equivalent to a 52 in Structure on either of the other two. 

In practice this scale has been too confusing and difficult to ad- 
minister; teachers find the analysis of compositions for three separate 
scores a bewildering task. 

A critical discussion of the composition scales here mentioned as 
well as several others, up to the year 1923, has been given by 
Hudelson (xr). Also many valuable considerations relevant. to 
composition teaching and measurement are pointed out. 

In 1925, Hudelson (12) reported on the reliability of teachers' 
ratings of compositions with and without the use of the Hudelson 
Typical Composition Scale. He had 157 judges rate the same number 
of themes. Reliability was measured in terms of the scatter of the 
ratings. The smaller the scatter, the greater the reliability of the 
ratings. The first time the scale was used, the scatter increased 
slightly, but successive applications increased the reliability of the 
ratings until at the fifth application it was much greater with the 
scale than without it. 

W. W. Theisen (22) compared the average variations in fif- 
teen teachers’ ratings of twelve specimens of English composition 
first according to the ordinary percentile system and later with the 
Nassau County Supplement to the Hillegas Scale, Without the scale 
the average variation was 19.1 and with the scale it was 11.6. For only 
two of the twelve specimens was the average variation greater with 


236 How to Evaluate 


the scale than without the scale and in one of these the discrepancy. 
was negligible. 

Ruch and Stoddard (19 : 128), however, found a definite loss in 
reliability with the use of scales, ratings with various scales yielding 
reliability coefficients averaging about .62, while the correlations be- 
tween the same two teachers’ ratings without scales was .80, 
Similarly, Gordon (7) found coefficients of correlation of paired 
raters with the Hillegas scale to be .46, the average coefficient for 
ratings without the scale being 48. When the forty-one raters were 
divided into two groups of twenty and twenty-one respectively and 
the average ratings of these groups were correlated, coefficients of 
87 were obtained both without and with the scale. 

It is apparent from these studies that substantial gains in the 
reliability of scoring compositions by means of quality scales are 
usually obtained only when the scales used have become thoroughly 
familiar to the raters who are using them. The scales must be studied, 
discussed, and practiced before they yield appreciable benefits. 

Laboratory Products and Procedures.—Illustrative of the applica- 
tion of the check list technique to products and procedures in the 
laboratory sciences is the method developed by Tyler (28 : 37-41) for 
determining the nature of students’ difficulties in using a microscope, 
Students who are having difficulty in this respect are selected 
from among others in a large class by first administering a group 
test to the entire class seated at microscope tables in the laboratory, 
The students are required to find an object under the microscope 
within three minutes; by quickly passing from microscope to micro- 
scope, the instructor is able to note those who are unable to pass - 
this preliminary test. The class is retested several days later, Students 
who fail in both tests are then given the following individual test. 

One student at a time is called into a special: room by the in- 
structor, Here a microscope is placed on a table, together with yeast 
culture, slide covers, cloth, and lens paper. The student is asked to 
find a yeast cell under the microscope. A record of the time is 
kept, 


. . ~ but instead of writing down his actions the check list [see Table 6] 
contains all of the actions both desirable and undesirable which have 
been observed thus far. The observer records the sequence of actions by 
placing the figure z after the description of the student's first action, a 


a. Takes slide 
b. Wipes slide with lens paper 
€. Wipes slide with cloth 


the table 
1. Pw drop er Gre oL RUN 


Wipes cover with finger 

Adjusts cover with finger 

Wipes off surplus fuid 

Places slide on stage 

Looks through eyepiece with right 


Tooke through eyeplece with left 
oe 


‘Turns to objective of lowest power 
Turne to low-power objective 


e perier--re 


Exsuterces 


i 


Hi 
8 
il 


H 
i 
: 


a AL FEE EEERESesbere e 
n ihi 
ult 


menta 
Now and deliberate: 


Fre FPePPVP 


figure 2 after his second, and so on. . . . By reading the actions in the 
order they are numbered one gets a detailed description of the student's 
Procedure in finding an object under the 

An examination of the sample will serve to indicate the usefulness of 


mm nil i nir bi in *à 

jini dii i pi i HHI it i 

jin uU Ae i 1 

ig s H Hg dH 

1 nieli JUN 1 i i | 
THe ee EL 
Hn er Hi H Hn H aiil HE HH 

M Tp HT HT 


How to Evaluate. 
is very 
a a 
approximately two 
one eye 
eyepicce—all 


LET i 4 1 
duum sg 
DID FEE 
Biber ibn: d d 
iilos nij HL 
Jo i i 
quay qe 
ee sly 
HTH RTH IT 


240 How to Evaluate 


REFERENCES 


1. Ballou, F. W., Harvard-Newton Composition Scales, Cambridge: — 
Harvard University Press, 1914. 3 
2. Brooks, F. D., “The relative accuracy of ratings assigned with and 
without the use of drawing scales,” School and Society, 27 : 518- 

520 (1928). p 

Brown, Clara M., Evaluation and Investigation in Home Economics, 

New York: F. S. Crofts & Co., 1941. 

Chapman, J. C., and Rush, Grace P., The Scientific Measurement 

of Classroom Products, New York: Silver, Burdett & Company, 

1917. ' 

Freeman, F. N., The Teaching of Handwriting, Boston: Houghton: 

Mifflin Company, 1914. 4 

. Fullerton, G. S., and Cattell, J. McK., “On the perception of small 

differences, Publications of the University of Pennsylvania, Philo- 

sophical Series, No. 2, 1892. 

7. Gordon, Kate, “A class experiment with the Hillegas Scale,” 
Journal of Educational Psychology, 9 : 511-513 (1918). 

8. Guilford, J. P., Psychometric Methods, New York: McGraw-Hill 
Book Company, Inc., 1936. 

9. Hillegas, M. B., A Scale for the Measurement of Quality in English. 
Composition, New York: Bureau of Publications, Teachers College, 
Columbia University, 1912. 3 

10. Hollingworth, H. L., “Judgment of the comic," Psychological Re- 
view, 18 : 132-156 (1911). f 

11, Hudelson, E., “English composition: its aims, methods, and meas- 
urement,” Twenty-second Yearbook of the National Society for the 
Study of Education, Part I, Bloomington, Ill.: Public School Pub- 
lishing Co., 1923. 

12. Hudelson, E., “The effect of objective standards upon composition 
teachers’ judgments,” Journal of Educational Research, 12: 329- 
340 (1925). 

13. Kline, L. W., and Carey, G. Lọ, “A measuring scale for free- 
drawing,” Johns Hopkins Studies in Education, No. 5, Part I, 1922; - 
Part II, 1933. 

14. McCarty, Stella A., Children’s Drawings, Baltimore: The Williams 
& Wilkins Company, 1924. 

15. Miller, D. B., “The Gothic lettering scale for grading slanting free- 
hand lettering,” Journal of Educational Research, 26 : 679-688 
(1933). 


w 


zx 


a y 


16. 


E 


18. 
19. 
20, 
ar 


22. 


23. 


30. 


Product and Procedure Evaluation 241 


Newkirk, L. V., and Greene, H. A., Tests and Measurements in 
Industrial Education, New York: John Wiley & Sons, Inc., 1935. 
Odell, C. W., “The use of scales for rating pupils’ answers to thought 
questions,” University of Illinois Bulletin, Bureau of Education Re- 
search, Bulletin No. 46, Urbana: University of Illinois, 1929. 
Quackenbush, G. M., “A scale for measuring the quality of hand- 
lettering,” Industrial Education Magazine, 30 : 229-231 ( 1928). 
Ruch, G. M., and Stoddard, G. D., Tests and Measurements in High 
School Instruction, Yonkers: World Book Company, 1927. 

Rugg, H. O., “A scale for measuring free-hand lettering,” Journal 
of Educational Psychology, 6 : 25-42 (1915). 

Symonds, P. M., Diagnosing Personality and Conduct, New York: 
D. Appleton-Century Company, Inc., 1931. X 

Theisen, W. W., “Improving teachers’ estimates of composition 
specimens with the aid of the Trabue Nassau County Scale,” School 
and Society, 7 : 143-150 (1918). 

Thorndike, E. L. “Handwriting,” Teachers College Record, 11 : 1- 
93 (1910). 


* Thorndike, E. L., “The measurement of achievement in drawing," 


Teachers College Record, 14 : 1-38 (1913). 


+ Thurstone, L. L., “Attitudes can be measured,” American Journal of 


Sociology, 33 : 529-554 (1928). 
Tiebout, Carolyn, “The measurement of quality in children's paint. 
ing by the scale method," Psychological Monographs, 48 : 85-94 


(1936). 


- Trabue, M. R., "Supplementing the Hillegas Scale," Teachers Col. 


lege Record, 18 : 51-84 (1917). 


- Tyler, R. W., "A test of skill in using a microscope," Constructing 


Achievement Tests, Columbus: Ohio State University, 1934. 


- Van Wagenen, M. J., "The Minnesota English Composition Scales; 


their derivation and validity,” Educational Administration and 
Supervision, 7 : 481-499 (1921). 

Willing, M. H., “The measurement of written composition in grades 
IV to VII,” English Journal, 7 : 193-202 (1918). 


1 
H 
i 


CHAPTER XII 


pu 


i 
i 
i 
Essay Testing i 
i 


IN THE PRECEDING CHAPTER WE DREW ATTENTION TO THE CONCEPT OF 
pupil product as an evaluation device and of the close relationship 
of essay test questions and answers to this concept. Both in their 
construction ahd scoring essay tests should be considered amenable 
to the same general rationale as was presented for product evalua- 
tion devices. The essay test is given separate treatment, however, 
because of the distinctive position it has heretofore occupied in the 
thinking and practice of evaluation workers. The general considera- 
tions in the preceding chapter have thus taken specific and dis- 
tinctive forms in the field of essay testing and should be explicitly 
set forth for the sake of clarity. 

The discussion of the relative merits of essay tests and short- 
answer tests in Chapter VIII came to the conclusion that for certain 
purposes essay tests are to be preferred. Let us briefly review the 
purposes and situations in which the essay test is necessary. In the 
first place, such courses as English composition and journalism have 
as one of their main objectives the ability to write essays which essay 
tests can elicit directly. The pupil’s knowledge of sentence structure, 
punctuation, and spelling constitutes an essential tool in essay 
writing and may be evaluated by short-answer tests, but the only 
sufficient and complete indication of his ability to express himself 
in an organized fashion is the essay test. Consequently the essay 
test must always be used for this purpose. 

Similarly, in advanced courses the ability to assimilate, organize, 
and evaluate critically large amounts of subject matter is usually a 
most important objective. This is, of course, especially true in the 
social studies. Whatever other short-answer devices may be used in 

242 


Essay Testing 243 


such a form as to get at objectives similar to these, they must always 
fail to elicit the group of mental processes which is involved in 
writing answers to essay, or thought, questions. It is true that many 
essay questions have been of such nature that short-answer testing 
would have been more suitable for the purpose. But at its best, the 
essay or thought question elicits mental processes which are difficult 
to approach by other means and are extremely desirable objectives 
in many educational situations. Furthermore, as we have seen, the 
essay test possesses distinct superiority as a device for motivating 
pupils’ learning in such a way that they will be able to recall ma- 
terial in an organized fashion and to know facts when cues are not 
given. 

Thus, perhaps the first step in improving the essay test is an 
appreciation of where it should and should not be used. It should not 
be used when the short-answer test elicits similar mental processes 
and yet possesses far greater superiority in objectivity of scoring. 
Recall of information, interpretation of data, or application of 
principles may often be better evaluated by means of a short-answer 
test than by an essay test. Further illustrations of where the essay 
test should not be used will emerge in our discussion of its con- 
struction. 


Tue Construction or Essay Tests 


Once it has been decided to restrict the use of the essay test to the 
evaluation of the type of achievement for which it is best suited, we 
are faced with the task of defining this achievement in terms of 
mental processes or of ways in which subject matter may be handled. 
The operation of a mental process assumes the presence of some 
material or subject matter to be processed or handled by the pupil. 
That is, a pupil cannot subject material, facts, or information to any 
mental process unless he possesses or is familiar with that material. 
To ascertain whether he possesses the knowledge or “raw material” 
without which mental processes as here understood are impossible, 
it is usually sufficient to employ the short-answer type of test. This 
does not mean that short-answer or “objective” tests must be re- 
Stricted to the testing of “mere” factual information but rather that 
they are more suited than the essay test for this function. The essay 
test has long been used in a confused way for the simultaneous test- 


244 How to Evaluate 3l 


ing of both factual information and the ability to put this informa- - 
tion through various mental processes. This failure to restrict its 
use to the purposes for which it is best suited has resulted in g 
loss in testing efficiency, time, and effort, to say nothing of the lo 
reliability and validity of evaluation resulting from this confus 
of the functions of the two types of tests. 

One reason for the failure to use the essay test only where it is 
most suitable has been the unfamiliarity of teachers with the sho 
answer form of test item. Another is that teachers have not 
sufficiently conscious of the large variety of types of essay questio 
which can be asked and of the need for framing questions in su 
a way as to elicit the kinds of mental processes which constiti 
instructional objectives. This latter reason. may perhaps be largely 
eliminated if teachers will become familiar with such lists of types 
of thought question as presented by Monroe and Carter (2) in 19 
These authors listed twenty types of thought questions, Altho 
this list cannot claim to be exhaustive or composed of mut 
exclusive items or based on experimental research, it should pre 
helpful to teachers wishing to realize the full range of potentiali 
of the thought question. It must be remembered that not only th 
question determines the mental processes required or elicited, but a 
the way in which the subject matter has been taught, the 
methods employed by the student, etc, The same question n 
require "thought" from one student, memory from a second, à 
other mental processes from a third, The thought question 
given student today may be a memory question for the same stud 
tomorrow. Following is the list of types of thought questions 
by Monroe and Carter, together -with illustrations of each 
drawn wherever possible from these authors: 

1. Selective recall—basis given. 

Name the Presidents of the United States who had been 
military life before their election. 
What do New Zealand and Australia sell in Europe that ma 
interfere with our market? : 

2. Evaluating recall—basis given. 

Which do you consider the three most important American in 
ventions in the nineteenth century from the standpoint of expansio! 
and growth of transportation? 


Essay Testing 245 


Name the three statesmen who have had the greatest influence on 
economic legislation in the United States. 


. Comparison of two things—on a single designated basis. 


Compare Eliot and Thackeray in ability in character delineation. 
Compare the armies of the North and South in the Civil War as 
to leadership. 


. Comparison of two things—in general. 


Compare the early settlers of the Massachusetts colony with those 
of the Virginia colony. 

Contrast the life of Silas Marner in Raveloe with his life in 
Lantern Yard. 


. Decision—for or against. 


Whom do you admire more, Washington or Lincoln? 
In which in your opinion can you do better, oral or written ex- 
aminations? 


. Causes or effects. 


Why has the Senate become a much more powerful body than 
the House of Representatives? 

What caused Silas Marner to change from what he was in Lantern 
Yard to what he was in Raveloe? 


. Explanation of the use or exact meaning of some phrase or statement 


in a passage. 
‘Tell how a siphon works. 
What did Hamlet mean by “be” when he said “To be or not to 


be, that is the question”? 


. Summary of some unit of the text or of some article read, 


State the plot of The House of Seven Gables in about one hundred 


words, 
Tell briefly the contents of the Declaration of Independence. 


. Analysis. (The word itself is seldom involved in the question.) 


What characteristic of Silas Marner makes you understand why 
Raveloe people were suspicious of him? 
Mention several qualities of leadership. 


. Statement of relationship. 


Why is a knowledge of botany helpful in studying agriculture? 
Tell the relation of exercise to good health. 
Illustration or examples (your own) of principles in science, con- 
struction in language, etc. 
Give two examples of the use of pure carbon in industrial work. 
Illustrate the incorrect use of a relative pronoun with a parentheti- 
cal phrase. 


246 


12. 


13. 


I4. 


15. 


16. 


17. 


18. 


19. 


20. 


How to Evaluate 


Classification. (Usually the converse of number eleven.) 

Group the following words according to their part of speech and 
name each group: red, boy, run, house, in, with, small, slowly, ball, 
etc. 

What do four of the five men named below have in common? 
How do they differ historically? 

Aristotle, Pericles, Homer, Cicero, Phidias. 

Application of rules or principles in these situations. 

Would you weigh more or less on the moon? On the sun? Why? 

If you sat halfway between the middle and one end of a seesaw, 
would a person sitting on the other end have to be heavier or lighter 
than you in order to make the seesaw balance in the middle? Why? 
Discussion, 

Discuss the Monroe Doctrine. 

Discuss early American literature. 

Statement of aim—author's purpose in his selection or organization 
of material. 

What was the purpose of introducing this incident? 

Why did he discuss this before that? 

Criticism—as to the adequacy, correctness, or relevancy of a printed 
statement or a classmate's answer to a question on the lesson. 

Why were the Articles of Confederation doomed to failure? 

What is wrong with the following menu? (Menu given below.) 
Outline. 

Outline the foreign policy of the federal government during the 
Civil War. 

Outline the steps required in computing the square root of a five- 
figure number. 

Reorganization of facts. (A good type of review question to give 
training in organization.) 

The student is asked for reports where facts from different or- 
ganizations are arranged on an entirely new basis. 

Formulation of new questions—problems and questions raised. 

What question came to your mind? 

What else must be known in order to understand the matter 
under consideration? 

New methods of procedure. 

Suggest a plan for proving the truth or falsity of some hypothesis. 

How would you change the plot in order to produce a certain 
different effect? : 


Essay Testing 247 


If this list and the accompanying illustrations are studied until 
the teacher has a real appreciation of their meaning, many hitherto 
neglected possibilities in essay testing will be revealed. It should be 
clear that some of these types, such as numbers 1, 2, 11, 12, and 
19, can be formulated in short-answer form, such as completion or 
simple-recall items. Attention to these types should also assist teachers 
in attaining a close relationship between instructional objectives and 
methods of evaluation, both for motivational purposes and for 
attaining validity. 

Weidemann (8) has attempted to refine still further the analysis 
of written essay questions into a series of definable types. He sug- 
gested eleven new essay-type questions proceeding from simple to 
complex, and for examination purposes only, as follows: 


1. what 2. list 7. explain 
who 3. outline 8. discuss 
when 4. describe 9. develop 
which 5. contrast 10. summarize 
where 6. compare 1r. evaluate 


For each of these a standard definition was proposed according to 
which teachers were to ask pupils to pattern their responses and by 
which the answers were to be graded. For example, a pupil's re- 
sponse to a "contrast" question should consist of a list of items of 
fact identifying dissimilarities between two concepts. In response 
to a "compare" question the response should consist of two lists of 
items of fact concerning two concepts, one of the lists identifying 
similarities, and the other dissimilarities. In response to an "explain" 
question, the response should be a list of items of fact, with each 
fact supported by a reason. In response to a “discuss” question, the 
response should consist of a multiple type of the "explain" essay in 
the form of three lists: (1) a list of affirmative arguments with 
reasons supporting them; (2) a list of negative arguments with 
reasons supporting each argument; and (3) a list of arguments re- 
futing each negative argument and giving reasons supporting each 
argument, If pupils are instructed to pattern their responses along 
these lines, consistency should result in each pupil's understanding 
of what he is required to do by a certain question and in the way 
the teacher can grade pupils’ responses to that question. The first two 
types listed by Weidemann are, of course, similar to recall tests that 


248 How to Evaluate 


may be arranged in short-answer form so as to increase objectivity 
of scoring. 

The care which should be exercised in the preparation of essay 
tests is well illustrated by the work of the College Entrance Ex- 
amination Board, particularly with its English examination (3). 
The fundamental purpose of the examination is to test the student's 
power to think through and to organize the materials contained 
in the books he has read, to read poetry and prose intelligently at 
sight, and to express his ideas in an effective way. A conscious effort 
is made by the examination builders to define as accurately as pos- 
sible the instructional objectives at which each question is aimed. 
In order to test the student's normal writing habits, all candidates 
are asked "to run the same race" by having them write on a single 
essay topic rather than permitting each student to select from among 
a collection of possible topics the one which supposedly provides the 
most "inspiration." Furthermore, each question is pretested before 
it is used by giving it to a group of students similar to those who 
will eventually take the English examination. This enables the 
examiners to find out how their questions are most likely to be 
misunderstood, assists in the selection of material, such as passages 
of prose or poetry, which will produce the best results, and enables 
an estimate of the length of time to be allotted to each question so 
as to avoid either too long or too short an examination, While 
pretesting is probably not possible for the average classroom teacher 
constructing the first draft of an essay examination for use in a single 
classroom, it indicates the caution exercised by experts in constructing 
essay questions. 

A further control for essay questions which will serve to define 
more closely what is required of the pupils and to introduce greater 
consistency in their responses is to limit the amount of writing to 
a fixed number of words. In this way some indication is given to 
the pupil of how expansive his response should be, what degrees of 
generality and specificity it should contain. 

We may now summarize the suggestions calculated to improve 
the construction of essay questions as follows: 

1. Use essay questions to evaluate achievement of only those in- 
structional objectives not amenable to testing by the short-answer 
forms. 


Essay Testing y 249 


2. Phrase the questions so as to require as precisely as possible 
the specific mental processes operating upon specific subject matter 
that are embodied in the instructional objective at which the ques- 
tions are aimed. 

3. Phrase the questions so as to give as many hints concerning the 
organization of the pupil’s answers as are not inconsistent with the 
instructional objective at which the questions are aimed. For example, 
pupils should not be asked merely to discuss a specific topic but 
should also be given the basis upon which the discussion should be 
based. "Thus, the question "Discuss the Articles of Confederation" 
would probably be improved from the standpoint of consistency of 
pupils’ responses and of reliability of grading if it were elaborated 
as follows, “Discuss the Articles of Confederation with respect to 
their origin, their working out in practice, and their relationship to 
the present federal Constitution.” Such hints concerning organiza- 
tion of answers should not be given, of course, if part of the ob- 
jective at which the question is aimed is the pupil’s ability to dis- 
tinguish the relevant bases for his discussion. Such assistance to the 
Pupil really operates to increase the number of questions or sub- 
questions in an essay examination and to reduce the length of the 
answer to each. As such, the more specific the essay question be- 
comes, the more similar it becomes to short-answer forms of test 
items. Carried to an extreme, this technique would rob the essay 
question of its unique value in testing the pupil's ability to organize 
and express his answers. Each teacher must therefore maintain a 
fine balance between generality and specificity in essay questions so 
as to elicit as much organizational effort from the pupils as possible 
while at the same time giving all of them a common set of reference 
points so that their answers will be comparable with one another. 

4. Permit no choice among questions, Only by requiring all pupils 
to answer all questions can all be made to take the same test and 
thereby their achievement be rendered comparable. It is almost im- 
Possible to equate optional questions for difficulty without an elabo- 
Tate pretesting program. Hence the teacher who permits pupils to 
choose among optional questions can never know whether all of 
them have taken a test of equal difficulty (6). Nor can pupils be 
relied upon to choose essay questions from a series of options that 
will enable them to exhibit their achievement in the best possible 


ago How to Evaluate 


light. Some pupils may think that by choosing more difficult ques- 
tions they will receive greater leniency and more credit from the 
teacher, but others in the same classroom will not reason in this 
way. Meyer (1) has found that pupils’ rankings of the quality of 
their responses to essay questions agreed poorly with the rankings 
given by the teachers grading these responses. If pupils had been 
allowed to choose only the questions whose answers they thought 
they knew best, they would not always have chosen the questions 
which, upon subsequent grading of their papers, turned out to be 
those they could best answer, 

5- Balance the questions in difficulty so that the pupil can actually 
write adequate answers to all of them within the allotted time if he 
possesses the required achievement, Furthermore, the questions 
should be arranged in order of difficulty for the same reasons as 
were mentioned in connection with the short-answer test form, 


Granna rin Essay Tose 
= senha fresanimemesitier dea e n 
essay questions: (1) the percentage-passing method, (2) the quality 
scale, (3) the sorting or rating method, and (4) the checklist point- 
score method. It will be worth while to discuss cach of these methods 
although, as we shall see, the last is probably the best and should be 

used in most instances by classroom teachers. 
1. The percentage-passing method involves giving each question 
a definite value and marking every answer to that question on the 


disadvantage is that the standard of "passing" varies so widely from 
teacher to teacher. Also, it permits such a wide range of scores— 
from zero to 100—that it gives a spurious notion of the fineness of 


Hi inninga za 
3 i: uH jii ; 
i TI Hi anpali UD 


verage j 
9, 1, 
prepared, 
ying 
scales, 
identical 


deserved ratings of 


according to the a 


inre? 


252 How to Evaluate 


if the papers were “normally distributed” would be 10 per cent, 

20 per cent, 40 per cent, 20 per cent, ro per cent, but no attempt 

should be made to conform rigidly to these proportions.) 

Reread the papers in each group and shift any that this second 

reading indicates have been misplaced. 

3. Give no numerical grades or separate evaluations of individual 
questions; group each paper according to general total merit. 
Assign the same letter grades to the Papers in each group, A for 
very superior, B for superior, etc. 

This rating procedure is, of course, preceded by the rater’s prepara- 

tion in working out, with the use of the textbook and other ma- 
terials on which the examination is based, a set of acceptable answers 
to the essay questions. It may be objected that this method entails 

a considerable sacrifice of refinement of grading in some cases 
where more than five levels of general merit can be distinguished 

in pupil's answers. Considerable diagnostic value may also be sacri- 

ficed by disregarding separate questions and considering merely the 
general merit of the examination paper as a whole. Teachers may 
have considerable difficulty in placing papers of which one answer 
is distinctly superior while another answer is distinctly inferior. 

In such a case, the "general merit" of the paper would be a dis- 
tinctly artificial concept, telling little of the pupil's specific achieve- 

ment. Of course, the rating method may be used with separate 
questions, the pile of papers being sorted for one question at a time, 

then recombined into one pile and resorted for the next question. 

Here again, however, the sorting or rating of each answer accord- 

ing to its general merit can easily become a vaguely defined pro- 

cedure, subjective and without any helpful quantitative basis, It is 
to remedy this defect that the fourth procedure has been designed. 

4. The check-list-point-score method involves analyzing the ideal 
response to the essay question into a series of features or points, each 
specifically defined. The pupil’s answer is then judged with respect 
to each feature and a point is awarded if the feature is present in the 
response. This method is well illustrated by the procedure of the 
readers of the English examination of the College Entrance Examina- 
tion Board (3). Each of the readers is rigorously trained and re- 
tained during the week of examination grading in the following pro- 
cedure: 


2. 


= 


Essay Testing 253 


Each question is graded on a series of points suggested by the terms 
and the purpose of the question itself, and determined by a careful 
analysis of the answers written by the students. Each reader will deal, 
at most, with only a third of any answer book, and he will grade it 
according to standards which he himself has helped to define and estab- 
lish. He is trained to analyze each answer to discover the presence or 
absence of certain qualities, or elements. The new analytical method of 
grading has proved especially valuable in dealing with the essay ques- 
tion. As all English teachers know, marking themes is a ticklish business. 
Many elements have to be considered: the student's organization of his 
material; technical points of composition such as spelling and punctua- 
tion; sentence and paragraph structure; the right or wrong use of words, 
and so forth. Our present system, by giving definite and predetermined 
values to each of these elements, makes it possible for all the readers to 
use approximately the same yardstick, and hence has contributed im- 
mensely to the increase in reliability of reading. Last year, for example, 
the maximum grade for the essay question was eleven points. Four were 
awarded for accuracy in writing, or technique of composition. Three 
Were given for organization in paragraph structure, and four for varying 
knowledge and skill in the use of books required by the topic. 


A further illustration of this method is the way in which the 
Board graded the following item on the interpretation of sight 
passages: 


A metaphor is a transfer of meaning, one thing or act being named 
or implied when another is meant. It is the commonest and most serv- 
iceable figure in language. . . . There is at least one metaphor in each 
of the following passages. Indicate the metaphors in each passage and 
translate them in such a way as to show your understanding of the 
author’s use of them. Allow about twenty-five minutes for this question; 
The model below is intended to suggest the kind of an answer expected, 
[Model omitted.] 

1. “Poverty is the banana skin on the doorstep of romance,” P. G. Wode- 
house. 


The pupil’s answers to questions in this series were to be marked on 
à maximum of five points: 
Adequate comprehension of the passage as a whole, r point. 
Explicit indication of the principal metaphors (banana skin and 
doorstep), x point. 
Adequate translation of these terms, 1 point. 


254 How to Evaluate 


More than a merely adequate understanding of the passage, 
i.e., recognition of Wodehouse's humorous purpose, 1 point. 
Composition—To all answers not incoherent or marred by 
Serious grammatical errors, 1 point. 

For example, the following response would receive credit for ^ 
only the fifth point: “In this sentence Wodehouse means the banana 
skin to be an aid rather than a hindrance to romance. By this Wode- 
house shows that a person does not stop to think about falling in 
love merely because of poverty.” The pupil failed to comprehend the 
meaning of the passage, indicated only one of the metaphors, mis- 
translated that one, and loses the fourth point for not having more 
than a merely adequate comprehension. He receives the fifth point 
because his answer is reasonably grammatical and has no mis- 
spellings, and the sentences are properly punctuated, 

The following answer receives credit for all points except the 
second: “Wodehouse is humorously saying that poverty is the danger 
that romance faces at its beginning.” The student gained the first 
point because he clearly understood the passage, lost the second for 
failure to indicate the metaphors explicitly, and won the third for 
translating both metaphors, the fourth for recognizing the humorous 
purpose, and the fifth for composition. 

The reliability of grades obtained with this method of scoring 
has been shown by Stalnaker (7) to be very high, ranging above 
90 for all College Entrance Board examinations except English, 
whose reliability of reading was .84. It must be emphasized that such 
excellent results can be obtained only when the persons who grade 
the essay test papers have undergone a careful training or self-train- 
ing program. 

It must be remembered that in grading according to this system. 
the teacher must work out a rigorous analysis of the things to be 
desired in the pupil's response to a question in the definite and ex- 
plicit forms indicated in the above example. This means that he 
should write out an answer to the questions so as to contain all the 
ideas desired from pupils. Perhaps another list of ideas which 
should not be regarded as acceptable for credit should also be made. 

Further Aids in Grading Essay Tests.—The following procedures 
and devices have also been suggested as means of increasing the 
reliability, or objectivity, of grading essay examination papers: 


179 


Essay Testing 255 


1. Grade papers anonymously so that the grader does not know 
whose paper is being graded. Personal factors such as teacher-student 
relationships can thus be largely eliminated and the paper graded 
solely on its merit. A teacher's general opinion of a pupil, his 
prejudice for or against the pupil, cannot then affect the grade 
assigned to the paper. The desired anonymity may be secured by 
having the pupils write their names either on the back or at the end 
of the examination paper, or by assigning randomly selected num- 
bers to the pupils in such a way that they are unknown to the 
teacher, and later having the pupils put their names opposite the 
list of numbers. 

2. Grade only one question at a time. In this way one pupil's 
answer to a given question is more easily compared with all the 
other answers to the same question. Also, this procedure requires 
the teacher to keep only one list of points in mind at a time, Thus he 
does not have to waste time in continually refreshing his memory 
concerning the points required in successive questions. 

3. Use double grading wherever possible, especially with im- 
portant examinations that determine promotion, graduation, and 
so on. Double grading is simply the process of having more than 
one person grade a paper, The graders should, of course, have the 
same standards and consider the same points in grading. Such 
agreement between graders, and equal competency of graders, are 
difficult to secure without the expenditure of additional time and 
effort by relatively highly competent personnel, such as teachers; 
This extra labor is the chief drawback of double grading, but it is 
more than justified in view of the demonstrated unreliability of 


- much essay grading. Whenever two graders disagree sharply with 


each other, the papers should preferably be read by a third grader 
rather than arbitrated by the two original graders, for in the latter 
Situation the grade will frequently be more weighted by the 
Opinion of the grader.with the more aggressive and dominant 
personality. 

4. Make special provision for the consideration of sentence struc- 
ture, paragraphing, writing ability, spelling, and so forth. These 
factors should not be permitted to increase or decrease the student's 
Score on an essay question which is primarily concerned with other 
instructional objectives. Since the English department is primarily 


256 How to Evaluate 


concerned with these aspects of the pupil's writing, they should be 
disregarded in evaluating achievement in the social studies, sciences, 
and so on. It will, of course, require a conscious effort on the part 
of the teacher to prevent himself from being influenced adversely by 
poor spelling and grammar, or, on the other hand, favorably 
prejudiced by neat handwriting, extensive vocabulary, and fine prose. 
While all these factors in pupils’ answers to essay questions, with 
the probable exception of quality of handwriting, will in the long 
run be found to be correlated with pupils" achievement in all sub- 
jects, they are, strictly speaking, irrelevant to the other kinds of 
achievement at which the essay test is aimed. 


SUMMARY 


Essay tests are best used to evaluate achievements which cannot 
be better evaluated by means of short-answer tests. These achieve- 
ments include writing essays, and assimilating, organizing, and 
evaluating large amounts of subject matter. The construction of valid 
essay tests requires an appreciation of the large variety of forms in 
which essay questions can be put. Lists of types of essay questions 
are presented with illustrations. Methods of grading essay tests are 
the percentage-passing, the quality scale, the sorting or rating, and 
the check-list-point-score methods. The last one is probably the best 
in that it has been found to yield the highest reliability. It depends 
for its best results on the rigorous training or self-training of the 
test graders. Further aids-to essay test grading are grading anony- 
mously, considering single questions at a time, double grading, and 
giving separate consideration to the mechanical aspects of a pupil's 
answers. 


QUESTIONS 


1. In what sense does the improvement of essay questions make them 
more similar to short-answer test items? What properties remain 
unique with essay tests? 

2. To what extent can the excellence of an essay question be judged 
from the comparability, or degree of presence of common elements, 
of the answers it elicits? 

3. Is the use of a rigorously defined set of scoring standards incom- 
patible with the aim of essay tests to reveal originality and self-ex- 
pressiveness? 


Essay Testing Ro ae ov 


. One type of essay question presents a scrambled outline of a discus- 
sion of a given topic and instructs the pupil to write a well-organized 
discourse on the topic using the scrambled outline as a guide. Evaluate 
this device from the standpoint of the mental processes it may elicit 
and the subject matter for which it is particularly fitted. 

. It has been argued that keeping answers to essay tests anonymous 

during grading prevents the teacher from evaluating answers in the 

light of individual backgrounds. What do you think of this argu- 


ment? 


REFERENCES 


- Meyer, G., “The choice of questions in essay examinations," Journal 
of Educational Psychology, 30 : 161-171 (1939). 

. Monroe, W. S., and Carter, R. E., "The use of different types of 
thought questions in secondary schools and their relative difficulty 
for students," Bulletin No. 14, Bureau of Educational Research, Col- 
lege of Education, University of Illinois, 1923. 

+ Noyes, E. S., “Recent trends of the comprehensive examination in 
English,” Educational Record, Supplement No. 13, 21: 107-119 
(1940). 

. Odell, C. W., “The use of scales for rating pupils’ answers to thought 
questions,” University of Illinois Bulletin, Bureau of Educational Re- 
search, Bulletin No. 46, 1929. 

. Sims, V. M., “The objectivity, reliability, and validity of an essay 
examination graded by rating,” Journal of Educational Research, 
24 : 216-223 (1931). 

. Stalnaker, J. M., “A study of optional questions on examinations,” 
School and Society, 49 : 829-832 (1936). 

. Stalnaker, J. M., "Essay examinations reliably read,” School and 
Society, 46 : 671-672 (1937). 

- Weidemann, C. C., “Review of essay examination studies," Journal 
of Higher Education, 12 : 41-44 (1941). 


T ÓÓÓÓÓ— € 


CHAPTER XIII 


Evaluating Physical Aspects of 
Pupils 


Feacuscansuussutensesazeseuersuaensnsecaneses: 


WE HAVE ALREADY DISCUSSED, IN CHAPTER III, THE NATURE OF THE 
physical aspects of pupils with which teachers should be concerned, 
_ the reasons for their concern with these aspects, and the relationship . 
of the teacher to other school health agencies. In this chapter we shall 
Present the techniques whereby the physical aspects of pupils may 
be evaluated by teachers. 
_ It will be recalled that the role of the teacher in providing this 
evaluation and health service must always be as a supplement, in 
whatever degree necessary, to the health agencies the school may 
provide, such as school nurses and physicians. Teachers cannot be 
expected to possess, except in very slight degree, the knowledge and 
technical information required for detailed diagnoses of physical 
defects and diseases. Rather, they must serye merely in the capacity 
of enlightened laymen exercising whatever powers of observation 
they can acquire both by the Proper mental set toward physical 
aspects of pupils and by such suggestions, directions, and informa- 
tion as can readily be acquired in the teacher-training curriculum. 
Teachers should be willing to accept the encouragement of the medi- 
cal profession that they can do effective work in the detection of 
physical disorders incompatible with normal growth, development, 
and educational progress, or dangerous to other members of society. 
If every pupil could be examined by a physician every day of the 
school year, it would be largely unnecessary for the teacher to be 
concerned with this type of evaluation. Since this is manifestly im- — 
possible even in the largest city school systems, the teacher must 
step into the breach and serve as the best possible substitute for such 
service. It has been demonstrated in many schools that only a short 
258 


Evaluating Physical Aspects of Pupils 259 


time is required for teachers to become sufficiently acquainted with 
the signs and symptoms of physical disorders to be able to evaluate 
their pupils in this way and call attention to such conditions as need 
examination by a physician. It is the purpose of this chapter to 
provide teachers with such information and suggestions as will 
enable them to carry out this work most efficiently. 

Of the physical aspects of pupils which were enumerated in 
Chapter III, some may be evaluated every day in a somewhat casual 
fashion which does not require any special equipment or prepara- 
tion. Among these are the pupil’s carriage or posture, breath, neck, 
chest, back, legs, feet, and clothing. The presence of communicable 
diseases usually can be detected by informal observation: of the 
Pupil's appearance and demeanor, with reference to such lists of 
symptoms as will be presented below. Other physical aspects are 
best evaluated by teachers on special occasions with special equip- 
ment which is usually readily available. Pupils’ growth, height and 
weight, eyes, ears, and teeth all require special evaluative procedures, 

Let us now take up each of the physical aspects of pupils discussed 
in Chapter III with reference to the techniques and considerations 
involved in evaluating them. 


Various Puysicat Aspects AND MEANS oF EVALUATION 


1. "Growth.—Determining the pupil's weight and height are the 
two most important procedures involved in evaluating his growth. 
The chief value of weighing and measuring pupils lies in the educa- 
tional experience for them, not as an index of pupil health. Height- 
weight-age tables can easily be misinterpreted, Equally healthy chil- 
dren of the same age may vary considerably in height or weight, or 
both together. Differences in race, nationality, and constitutional 
body type are not taken into account by such tables, Even more re- 
fined measurements of the size and weight of the body, based on 
elaborate anthropometric procedures, furnish no safe way of de- 
termining a pupil's physical fitness. 

Thus, Jenss and Souther (3) correlated four indices of body build 
(the Baldwin-Wood Weight-Height-Age Tables, the ACH Index 
of Nutritional Status, the Nutritional Status Indices, and the Pryor 
Width-Weight Tables) against clinical judgments by a physician. 
The subjects of their study were 713 seven-year-old children of both 


260 How to Evaluate 


sexes. None of the four indices proved to be an efficient method of 
identifying children who, according to the physician’s criteria of 
nutritional status, need of medical and dental care, etc., are likely to 
be physically unfit. “The indices are neither selective nor sensitive, 
as they fail to identify a considerable number of boys and girls whom 
a given criterion selects as likely to be in need of medical care or 
nutritional advice and assistance, and, in addition, they often identify 
children who were not selected by the criterion” (3 : 94). The interest 
of pupils in their own growth is, however, well satisfied by means 
of periodic measurements with a scale, A good procedure for weigh- 
ing pupils is as follows: 

a. Plan for regular monthly weighing, if possible, or less fre- 
quently at regular intervals, 

b. Always see that the scales are adjusted and balanced. They can 
be tested by placing the sliding weight at zero and observing 
whether the scales balance. Or the teacher may test the scales by 
weighing himself on a scale of known accuracy and then seeing 
whether he weighs the same on the school scale. 

c. Have the same set of conditions each time: the same scales, if 
possible; the same weight of clothing; shoes off, sweaters and wraps 
removed, empty pockets; the same day, the same hour. 

d. Have the pupil stand in the middle of the foot space on the 
Scale. 

e. Record the pupil’s weight to the nearest half pound. 

f£. Keep a permanent record of each weighing. 

g. Encourage older pupils to keep individual records and, per- 
haps, to make individual and group graphs of growth rate. 

The most commonly appreciated other factor in the growth of a 
pupil is his height. The procedure in measuring height should be as 
follows: 

a. Measure height three times a year—early fall, mid-winter, and 
late spring. 

b. Use the same measuring apparatus each time. A tape measure 
or yardstick can be accurately and permanently attached to a door 
jamb or wall. A yardstick can be nailed with the zero mark 36 
inches from the floor, the 36 inches being allowed for in the measure- 
ments. À still better device is the metal rod attached to many weigh- 
ing scales made for clinic or school use. 


Evaluating Physical Aspects of Pupils 261 


c. Be sure the pupil being measured stands without shoes, with 
heels flat on the floor and against the wall or rod, at full height, with 
hips and head touching the measuring rod. 

d. Take the measurement accurately. Use an accurate right angle 
—a chalk box, picture frame, or other device—with the long, flat 
side against the measuring tape; or use the sliding metal rod on 
the weighing scale if available. 

The terms “underweight” and “malnourished” should not be 
considered synonymous. The former is relative to a height-weight- 
age table, while the latter is relative to the pupil's own most healthy 
condition. Nevertheless, the measurement of height and weight with 
reference to tables can be valuable in revealing whether a pupil has 
gained from year to year satisfactorily and in interesting him in 
his growth progress. 'The most valid guide in evaluating growth, 
however, must always remain the personal experience, training, and 
insight of the observer. "Teachers seldom differ among one another 
greatly in their opinions concerning the healthy or unhealthy 
appearance of a pupil. It is this appearance, rather than size, to 
which we must attend when we speak of nutrition. Underheight and 
underweight are, of course, important and should be kept in mind, 
but more important are the pupil’s appearance of vigor or lethargy, 
alertness or dullness, the clearness of his eyes, the glow or pallor of 
his skin, and the lack or overabundance of the layers of fatty tissues 
beneath the skin. 

2. Posture or Carriage.— The posture or carriage of a pupil is 
easily evaluated by comparing his way of holding his body, of carry- 
ing himself—standing, walking, or sitting—with the standards of 
good posture. These standards are described and illustrated in Fig. 4. 

It is essential in evaluating posture to create such a situation that 
the pupils will not self-consciously assume a pose rather than their 
usual posture, A pose may either increase a slightly faulty position 
into a markedly abnormal one or, if a pupil is familiar with correct 
posture standards, may artificially result in a better posture than his 
usual one. Thus, misleading results will usually be obtained if 
pupils are lined up in groups for the examination of posture or 
shadow graphs. It is better to make the observation of posture un- 
obtrusively, during the course of evaluating other aspects or during 
the usual classroom routine. 


POSTURE STANDARDS 
Intermediate - Type Girls ] 
EXCELLENT GOOD POOR BAD 


EXCELLENT POSTURE GOOD POSTURE POOR POSTURE BAD POSTURE 

1. Head up ~ chín in 1. Head slightly forward 1. Head forward 1. Head markedly forward 
(Head balanced above 
shoulders,hips, and ankles) 

2. Chest up 2. Chest slightly lowered 2. Chest flat 2. Chest depressed (sunken) 


(Breast bone the part of 

body farthest forward) 3. Lower abdomen in 3. Abdomen relaxed (part of. 3. Abdomen completely 
3. Lower abdomen in, and flat (but not flat) body farthest forward) relaxed and protuberant 
4. Back curves within normal 4. Back curves slightly 4. Back curves exaggerated 4. Back curves extremely 

limits. increased. exaggerated 


Intermediate -Type Boys 
EXCELLENT GOOD POOR BAD 


POOR POSTURE BAD POSTURE 
1. Head forward 1. Head markedly forward 


EXCELLENT POSTURE 
1. Head up - chin in 


GOOD POSTURE 
1. Head slightly forward. 


(Head balanced above. 

shoulders,hips, and ankles) 2. Chest slightly lowered 2. Chest flat 2. Chest depressed (sunken) 
2. Chest up 

(Breast bone the part of 

body farthest forward) 3. Lower abdomen in 3. Abdomen relaxed (part of | | 3. Abdomen completely 


body farthest forward) relaxed and protuberant 
4, Back curves exaggerated 4, Back curves extremely. 
exaggerated 


3. Lower abdomen in, and flat. 
4. Back curves within normal 
limits. 


(but not flat) 
4. Back curves slightly 
increased 


Fic. 4.—Posture standards for girls and boys, (After 3 : 10-11.) 


262 


Evaluating Physical Aspects of Pupils 263 


Very poor posture can be corrected and is worth correcting, accord- 
ing to the findings of Klein and Thomas (6). These authors found 
that training pupils according to the exercises contained in Posture 
Exercises, a Handbook for Schools and for Teachers of Physical 
Education (5) strikingly reduced the prevalence of poor posture in 
school children. Furthermore, this improvement was associated 
with improvement in health, efficiency, and school work as revealed 
in attendance, deportment, and scholarship. 

3. The Skin and Hair.—General cleanliness can usually be evalu- 
ated by simple observation. It can be fostered by emphasizing it in 
the classroom, drawing the pupils’ attention to its desirability, and 
discussing with them the harmful aesthetic, social, and physical 
effects of uncleanliness. In general, uncleanliness facilitates infection 
with parasites of various kinds and helps the spread of contagious 
diseases. 

Among the various skin diseases for which teachers should be on 
the alert are ringworm, impetigo, scabies, acne, and eczema. Ring- 
worm shows itself by a slightly raised, reddish scaly spot which 
enlarges, the center clearing and the edges advancing as a superficial 
scaling in the form of a circular or oval “ring.” It may occur in any 
part of the body, on the face, neck, or arms, and also on the scalp 
or between the toes. Treatment is usually effective for ringworm on 
the body, but very difficult when on the head or feet. 

Impetigo appears as a crust of fingernail size, irregular in shape 
and honey-yellow in color, usually on the face or hands and often 
behind the ears. New spots can result from carrying thé material 
from one lesion to other parts of the skin. Since the disease is highly 
contagious, exclusion of the pupil from school should be enforced 
unless the lesions are effectively covered with a bandage. 

Scabies evidences itself by red points and lines on the skin which 
indicate the punctures and pathways of the itch mite that causes it. 
The itch mite is an insect which bores into the skin and causes an 
intense itching; scratching is the most obvious symptom. The mites 
themselves are too small to be seen easily. Because they are easily 
transmitted, especially within families, and can easily cause secondary 
infection, cases of scabies should be excluded from school. Reinfec- 
tion readily occurs unless the underlying colonies of the itch mite in 
clothes or bedding or other persons are eliminated. For this, the 


264 How to Evaluate 


whole family must be treated and its clothes and linen sterilized by 
heat. 

Acne is most frequent during adolescence, being characterized by 
pimples or “furuncles” on the skin of the face. The discharge from 
one pimple produces new lesions when spread to other portions of 
the skin so that the disease easily becomes chronic and lasts for - 
years, leaving unsightly scars. The mental hygiene aspect of acne 
is especially important because its unsightliness is apt to make - 
adolescents feel miserable. Obviously, teachers should be on the look. — 
out for incipient cases of acne, caution pupils against fingering the 
lesions, and advise prompt medical attention. Soap and water are 
an excellent preventive, and the frequency with which acne clears | 
up when boys begin to shave suggests that cleanliness can also assist | 
in the cure. Attention to diet, especially the limitation of sweets, 
cocoa, and chocolate, is also required in many cases. 

Eczema takes the form of patches of roughness that crack, scale, — 
and at times exude moisture. Frequently chronic, it is often due 
to allergy, the sensitivity of the individual either to certain foods or 
to something in the clothes or environment that comes in direct | 
contact with the skin. It is not contagious, but its discomfort and 
unsightliness should lead pupils to seek prompt medical attention. 

Apart from these skin infections, the teacher should be on the alert 
for contagious diseases which manifest themselves by the appearance — 
of the skin, such as measles, German measles, scarlet fever, chicken- 
pox, and small pox. In general, they are characterized by a rash of — 
red spots whose size and distribution on the body vary according 
to the disease. The scarlet fever rash is like a generalized blush, so 
that whole portions of the body appear red. The German measles | 
rash is composed of separate little red spots, closely set and merging 
together in places, but presenting the isolated lesions somewhere on 
the body. The measles rash has larger spots than German measles, — 
and they tend to form blotches of smaller or larger size. In these 
three diseases the rash spreads from the neck downward over the — 
chest, abdomen, and extremities; the scarlet feyer rash generally) 
avoids the face, but the other two do not. 

Chickenpox differs greatly from these three diseases, occurring | 
typically as tiny water blisters known as vesicles, that rupture easily — 
and are surrounded by a narrow zone of redness. Instead of spread- 


Evaluating Physical Aspects of Pupils 265 


ing downward in the wavelike fashion of the other diseases, chicken- 
pox lesions come out in groups in the spaces between the older 
lesions. In this disease, several stages of lesions are present at one 
time, and they are more abundant on the body than on the face or 
limbs. Smallpox is different from chickenpox in that the vesicles are 
all at the same stage, first little red bumps which gradually change 
into lesions with pus. It may be further differentiated from chicken- 
pox in that it affects the arms, hands, legs, fect, and face more than 
the trunk. 

The hair of pupils should be watched mainly for the presence of 
lice, or pediculi, and of nits, or the eggs of the lice. Any prolonged 
itching and scratching of the scalp should lead the teacher to sus- 
pect these conditions, The live lice are too small to be found easily, 
but the nits are easily seen as oval gray bodies clinging to the hair, 
especially behind the ears and on the back of the head. Like scabies, 
the condition is easily transmitted, especially within families, and 
generally must be treated with respect to the entire family at a time. 
Like scabies, head lice are more common in pupils from lower 
economic strata but are not confined to them. They should be eradi- 
cated not only as a sign of uncleanliness and a cause of discomfort, 
but also because their presence leads to infections of other kinds. 
Live nits can be killed by the application of appropriate medication 
to the scalp and removed by softening with vinegar and combing. 
For boys it is simple to cut the hair short. 

4. The Eyes.—Several methods can readily be used by teachers in 
evaluating pupils’ eyesight. Among these are the Snellen Chart, a 
check list of observable behaviors, the Eames Eye Test, and the 
Betts Telebinocular. The Snellen Chart contains letters* printed in 
such a way that each portion of the letter is as wide as the length of 
the tangent of an angle of one minute at the distance at which the 
letter is supposed to be read. This width is chosen because the 
smallest angle of visual acuity of a normal eye is approximately one 
minute. (The trigonometry involved here is not essential to the 
effective use of the chart.) The chart should be kept out of sight and 

1For very young pupils who have not yet learned the alphabet, special Snellen 
Charts with only E’s pointing up, down, or to either side are available. Snellen Charts 
are available from the American Medical Association, 535 N. Dearborn Street, Chicago, 
Illinois, and from the National Society for the Prevention of Blindness, 50 West soth 
Street, New York City. 


266 How " Evaluate 


in a clean place so as to prevent the pupil from becoming familiar 
with the letters and to keep it in good condition. When in use, it 
should be placed so that the pupil can stand at a distance of exactly 
20 feet from it; it should be in a good light (about ro foot-candles) 
that is evenly distributed, and the general illumination in the room 
should not be less than one-fifth of the chart illumination (not less 
than 2 foot-candles). The light should not shine into the pupil's eyes 
but should come from the side. The height of the chart should be 
such that the child's eye will be level with the 20-foot line. The 
20-foot distance from the chart should be marked off, and also the 
15-, 10-, and 5-foot distances; these shorter distances are used only ` 
for the purpose of teaching small children to use the chart and for 
those who cannot see the 20-foot line at 20 feet. The pupil should 
be instructed how to cover one eye with a cover card; both eyes 
should remain open during the test. The cover card should lie 
obliquely across the nose, being held at the edge. If the pupil wears 
glasses, he should be tested first with them and then without. 
Squinting and straining should not be permitted. 

A standardized routine should be followed to avoid confusion: 
test the right eye first, then the left eye, and then both together. 
Beginning at the top of the chart with the larger letters, the pupil 
should read all the letters he can. If he reads all of them, including 
those for the distance at which he is placed, he is not nearsighted, 
but he may be farsighted or moderately astigmatic, or have muscular 
strain when using both eyes. If he reads smaller letters than are 
expected of the normal eye (as in reading the 15-foot line at 20 feet) 
his vision is keener than that of the average person. Inability to read 
the line expected of the normal eye, but only the 30- or 40-foot line 
at 20 feet may indicate either nearsightedness, astigmatism, defects 
in the refractive media of the eye, corneal scars, or even fairly high 
degrees of farsightedness. 

The visual acuity for each eye should be recorded as a fraction 
whose numerator is the pupil’s distance from the chart, and whose 
denominator is the distance at which normal eyes can read the 
line. For example, 20/20 means normal visual acuity, i.e., the pupil 
can read at 20 feet the letters which normal eyes can read at 20 
feet; 20/200 means such defectiveness that the pupil can read only 
at 20 feet what normal eyes can read at 200 feet. 20/15 means being 
able to read at 20 feet what the normal eye can read only at 15 


Evaluating Physical Aspects of Pupils 267 


feet. These fractions should not be considered as percentages of 
normal vision. 

Measurements with the Snellen Chart should be correlated with 
observations of pupils’ visual behavior in the classroom. The follow- 
ing list (2) of observable behaviors which are symptomatic of 
disturbances of vision has been suggested: 


SUO DL OY LEN EL 


Attempts to brush away blurs. 

Blinks continually when reading. 

Cries frequently. i 

Has frequent fits of temper. 

Holds the book far away from face when reading. 

Holds his face close to the page when reading 

Holds his body tense when looking at distant objects 

Is inattentive during reading lessons. 

Is inattentive during wall-chart, map, or blackboard lessons, 
Is inattentive during class discussion of field trips or visits to 
museums. 


- Is irritable over work. 


Reads but brief periods without stopping. 


- Reads when he should be at play. 
. Rubs his eyes frequently. 


Screws up his face when reading. 
Screws up his face when looking at distant objects. 
Shuts or covers one eye when reading. 


. Thrusts his head forward to see distant objects. 


Tilts his head to one side when reading. 


. Has poor alignment in penmanship. 
- Has reversal tendencies in reading. 
. Tends to look crosseyed when reading. 


When reading, tends to make frequent changes in distance at 
which he holds his book. 

When reading, tends to lose the place on the page. 

Confusion in reading and spelling: o's and a’s; e's and c's; n’s 
and m’s; h’s, n's, and r’s; s’s and Z's. 


. Apparently guesses from a quick recognition of parts of the 


word in easy reading material. 


The Eames Eye Test? consists of seven parts, each of which is 
aimed at the detection of eye defects needing professional care. 
(a) The visual acuity test is the conventional Snellen-type test. 


? Obtainable from the World Book Company, Yonkers, New York. 


268 5 How to Evaluate 


(b) The lens test is similar to the visual acuity test except that the 
pupil attempts to read the letters while looking through a lens de- 
signed to detect farsightedness. (c) The astigmatic chart test is the 


conventional radiating-line test for astigmatism. (d) The coordina- 


tion test requires the pupil to look at a card depicting a chicken and 


a box through a hand stereoscope. (c) The fusion test uses a card - 


on which moon and stars appear, to detect defects of binocular 
vision. Tests (f) and (g) are for fusion of type and eye dominance, 


aspects of pupil vision that may be of interest to school psychologists - 


and remedial teachers. 
The Eames test represents a valuable compromise between the 
overly simple Snellen Chart and the more elaborate Betts Tele- 


binocular described below. Its relatively low cost and ease of ad- : 


ministration and interpretation increase its value for classroom 
vision testing by teachers in order to discover eye troubles needing 
professional attention without regard to specific diagnosis. Evidence 


that the Eames Eye Test possesses a high degree of reliability and 


validity, in terms of agreement with retests and with an oculist's ex- 


amination, has been presented by Eames (x). 

The Betts Telebinocular is an instrument for testing (a) the 
ability to fuse images created in the two eyes and (b) the ability to 
perceive depth. It consists of a pair of binocular lenses mounted 
on a stand and equipped with a staff or stand to which is attached 


a movable slide holder. The slides, placed at various positions along — 


the stand, are viewed through the binoculars. From the pupil’s 


report of what he sees in the various slides, the examiner can : 


determine such properties of the eyes as their visual acuity, muscular 
balance, ability to fuse images, astigmatism, and others. Some 
technical knowledge is required for the correct use and interpreta- 
tion of the Telebinocular; this can be acquired only by working 
with the instrument itself. If the school system can afford to make 
the instrument available to teachers, they can easily learn to use it 
as a valuable supplement in evaluating vision." 


Although only a small percentage of boys are color-blind, and a 


still smaller percentage of girls, pupils should be examined for 


color blindness because of its implications for educational and voca- — 


8 Information concerning the Telebinocular is available from the Keystone View 
Company, Meadville, Pa. 


Evaluating Physical Aspects of. Pupils 269 


tional guidance. Obviously, the color-blind boy should be discour- 
aged from aspiring to any occupation, such as railroad engineer- 
ing, where color blindness is a severe handicap. The Holmgren 
test for color blindness makes use of a standard test of various colored 
yarns which the pupil is required to group according to color. The 
Ishihara test for color blindness requires the pupil to read numbers 
formed by various colored dots on a background of colored dots. 
The special materials required for either of these tests can be ob- 
tained from any scientific supply company. 

5. The Ears.—A pupil’s hearing can be tested most accurately 
by the audiometer, an instrument developed by the Bell "Telephone 
Laboratories and the Western Electric Company.* It is obtainable 
in several different types, only some of which are adapted to testing 
hearing in school. The 4A audiometer, consisting of a phonograph 
and a telephone apparatus whose headset is clamped to the ear, 
enables the testing, one ear at a time, of any number of children up 
to forty at the same time. The testing room must, of course, be 
very quiet. The pupil is required to write on a special form the 
numbers dictated to him by the phonograph through the earphone. 
The completed forms, when compared with a key or standard, 
enable an estimate of the amount of hearing loss. For some chil- 
dren as many as four tests may be necessary to secure reliable 
results; children who cannot write must be tested in some other 
manner. 

The 5A audiometer, for individual testing, has a similar earphone 
but requires the pupil to push a button that lights a light when he 
no longer hears a buzz or hum, whose intensity the examiner 
regulates by turning a dial. The amount of hearing loss is read 
directly from the dial on the instrument. Advantages of the 5A 
are its simplicity, the short time (15-30 seconds) required to ad- 
minister the test, and the possibility of testing kindergarten children 
as readily as college students. Marked hearing defects for higher- 
pitched notes, however, are not always successfully detected with 
the 5A. 

When an audiometer is not available, cruder but valuable examina- 

* Information concerning audiometer equipment can be obtained from either the 


Volta Bureau, Washington, D.C., or the American Society for the Hard of Hearing, 
Washington, D.C. 


272 How to Evaluate 


examinations and the advice that should accompany them are the 
major outcome of the teacher's work. 

In examining the teeth the teacher should first observe the pupil 
with jaws closed and lips drawn apart, so that the efficiency of the 
bite essential to mastication may be seen. If a large number of teeth 
do not meet each other in the normal bite, chewing may be seriously 
affected, with a resultant loss of efficiency in the digestion of food. 
"The teeth should next be examined with the mouth wide open and 
the head thrown back. It is possible for even the untrained to 
note inflamed or ulcerated gums, decayed teeth, and general oral 
uncleanliness. Furthermore, the teacher should be alert to note 
irregularities of the jaws or teeth which mar facial appearance and 
speech. Whatever defects are discovered should be noted in the 
pupil's cumulative records for guidance purposes, brought to the 
attention of his parents, with full emphasis upon the seriousness of 
dental defects, and wherever possible referred to a dentist for cor- 
rection. 

8. The Breath—The pupil's breath can be noted during the 
dental examination, as well as during daily contacts. Halitosis 
should not be for the teacher a case of “even your best friend won't 
tell you." Rather with all possible tact he should call to the pupil's 
attention the social seriousness of bad breath and the possibility that 
it may be symptomatic of some oral, nasal, or alimentary defect 

_ of a serious nature. The cause and cure of persistent cases can be 
determined only by a dentist or a physician, to whom the pupil 
should be referred. 

9. The Throat.—Inspection of the throat should be done with the 
pupil facing a good light, usually without the use of a tongue de- 
pressor; the pupil should open his mouth wide, extend his tongue, 
and utter a prolonged “ah!” If a tongue depressor is found neces- 
sary, care should be taken that it is disposed of immediately after 
use so as to prevent its being handled or used again. If possible, 
an attempt should be made to inspect the tonsils, which are on 
either side of the throat behind the end of the soft palate. Teachers 
should not attempt to recognize abnormalities of the tonsils except 
in cases of gross enlargement, abscess, or acute infection. In a 
diseased tonsil, the pits or “crypts” are filled with cheesy material 
and the tonsil seems to be covered with yellow spots. The size of 


Evaluating Physical Aspects of Pupils 273 


the tonsils, whether large or small, is not of itself significant. Other 
symptoms for which teachers should be on the alert are a fiery red 
throat, with or without yellow spots on-the tonsils, and also the 
appearance of a small or large grayish patch on the tonsils or ad- 
jacent soft palate. Any of these conditions should be immediately 
brought to the attention of a physician. The health history of the 
pupil is of special significance here, frequent attacks of tonsillitis or 
rheumatism, huskiness of the voice, or hacking little coughs all 
being cause for serious concern. 

10. The Neck.—Enlarged lymph glands are the most frequent 
defect or disease in the region of the neck. They are detected by 
touch as lumps occurring in a slanting line from behind the angle 
of the jaw toward the junction of the breast and collar bones. 
Enlarged glands may indicate infection in the nose, mouth, or 
throat; decaying teeth, or possibly chronic infections, including 
tuberculosis. The pupil with enlarged lymph glands should receive 
immediate and thorough medical care. 

Goiter, or abnormal enlargement of the thyroid glands lying on 
cither side of the windpipe at the base of the neck, is another rela- 
tively frequent disturbance in the neck region. In some localities 
where there is a deficiency of iodine in the drinking water or 
food, simple goiters are common. Since the seriousness of a goiter 
can be determined only by a physician, all cases of enlarged thyroid 
glands should be referred to a physician. 

11. The Chest.—Serious deformities of the chest are usually 
visible through the clothing. A projecting breastbone—known as 
pigeon breast—or flaring lower ribs may thus be noticeable. The 
ability to breathe deeply, the expansion of the chest equally on 
both sides during breathing, normal speed in ordinary breathing, 
and the presence of a chronic cough should all be the subject of the 
teacher’s close observation. Similarly, such symptoms of a badly 
working heart as abnormally fast breathing on slight exertion, as 
when walking up stairs or playing games, and a purplish color of 
the lips (cyanosis) can also be observed by the alert classroom 
teacher and referred to a physician. 

12. The Back.—Such abnormalities as a stoop or an angular 
projection of the back or other structural defects of the skeleton 
can be observed by teachers. Serious lateral curvatures can be de- 


274 How to Evaluate 


tected by looking at the pupil from behind when he stands with 
feet together and both knees straight and noting whether one 
shoulder or one hip is higher than the other or whether one shoul- 
der blade stands out more than the other. The back, of course may 
be examined at the same time that posture is examined. 

13. The Legs and Feet.—Bowlegs, knock-knees, limping, limbs 
of unequal length, and stiff joints should all be looked for; ex- 
amination for flat feet can be made by having the pupils remove 


their shoes. The gait of pupils, the way in which they walk, may — 


indicate defects of the legs; a shuffling gait may indicate flat feet. 


The fit of stockings and shoes, whether too large or too small, - 


should also be observed, Clubfoot, while relatively rare, is of suf- 
ficient importance in the pupil's adjustment to be given special 
attention in the cumulative record of his physical aspects. 

14. Clothing.—Besides the shoes and stockings, the clothing 
should be evaluated for neatness, cleanliness, and suitability to 
weather conditions; it should be called to the attention of parents 
whenever found lacking in any of these respects. Teachers should 
also see that pupils do not wear overshoes indoors. 

15. Speech Defects and Abnormal Nervous Conditions.—Lisping, 
stuttering, involuntary twitchings, nervous tics, and similar phe- 
nomena can readily be observed by teachers. All such conditions 
should be brought to the attention of school psychologists, mental 
hygienists, or speech pathologists, since special attention can fre- 
quently remedy these handicaps to pupil adjustment. These defects 
should, of course, be carefully recorded and taken into account for 
guidance purposes. 

A summary of important points for observation has been drawn 
up by Rogers (7) and is given below. This summary, with a 


number of additions, may be used as a check list in the teacher's — 


observation of pupils’ physical aspects. 

Use of the Clinical Thermometer.—Whenever a school is not 
staffed with a nurse or physician, the teacher should know how to 
use and interpret a clinical thermometer for the purpose of deter- 
mining whether or not a pupil is running a fever. It is probably 


safe to say that whenever a pupil has a temperature more than — 
half a degree above or below 98.6" Fahrenheit, he should receive — 
the immediate attention of a physician. The thermometer should — 


General: & 

General impression of physique 
(age, race, and heredity taken 
into consideration) 

Vigor or weakness 

Alertness or listlessness 

Good or bad color 

Cleanliness or uncleanliness 

Face and lips: 

Cleanliness * 

Pallor 

Cyanosis or pallor of lips 

Flush of fever 

Ringworm, impetigo, or other 


disease 


Hair and scalp: 


Cleanliness and neatness 
Signs of vermin or disease 


Eyes and vision: 


Frequent errors in reading words 
or numbers 

Complaints of headache, pain, 
blurred vision 

Holding book too close 

Evidence of difficulty in seeing 
at a distance 

Congested eyes 

Red or crusted lids 

Test with Snellen letters 

Color-blindness tests 


Ears and hearing: 


Dullness and slow response 
Presence of discharge from ear 
Special test with watch, voice, 


or audiometer 


Nose: 


Inability to breathe freely with 
mouth closed 


Throat: 


Signs of inflammation 

Discased tonsils 

Obstructing tonsils 

History of frequent sore throat 
History of rheumatism 


Teeth: 


Decayed permanent teeth 
Need of adjustment 
Diseased gums 
Uneleanliness 


Neck: 


Enlarged lymphatic glands 
Enlarged thyroid glands 
Wry néck 


Chest; 


Deformity 
Rapid breathing, especially after 
slight exertion 


Stoop 

Projection backward of spine 
Unequal height of shoulders 
Unequal height of hips 
Projection of one shoulder blade 


Arms: 


Signs of scabies or ringworm 
Coldness or bluish appearance 


Legs: 


276 $ How to Evaluate 


always be washed with soap and cold water before and after use 
and kept in a saturated solution of boric acid or some other harm- 
less antiseptic. The mercury in a clinical thermometer does not 
fall unless it is shaken down. Before using the thermometer, the 
teacher should always be sure that the top of the column of mer- 
cury is well below the normal temperature point (about 98.67). 
If it is above this point, it should be shaken down by holding the 
thermometer firmly at the end opposite the mercury bulb and 
whipping it in the air several times. The bulb of the thermometer 
should be placed under the pupil's tongue, and the lips (not the - 
teeth) should be kept closed upon it for twice the length of time 
supposedly required for recording temperature. Thus, a “one- 
minute" thermometer should be left in the mouth for at least two 
minutes. When the pupil cannot breathe through his nose, it is not 
worth while to take the temperature by mouth. Rather, the ther- 
mometer should be placed in the pupil's armpit and the arm held 
close at the side. One degree should be added to the reading when 
the temperature is taken in this way. 


Epucationat GuipANCE BASED on THE EVALUATION OF 
PnvsicAL Aspects 


A list of the kinds of adjustments which may be made to the 
health needs of pupils was given in Chapter III. In general, the 
evaluation of the pupils’ physical aspects should be followed by the 
referral of cases needing attention to physicians and parents. Teach- 
ers must frequently serve as stimulators to overcome the inertia 
of parents in seeing that their children receive needed medical 
care. And after a physician has diagnosed and treated a physical 
defect, it is the responsibility of the teacher to see that the treatment 
is carried out in so far as she can control environmental conditions 
and pupil behavior in and outside the school. 

Organizing the Evaluation of Physical Aspects.—The daily in- 
spection or morning health review should become a classroom habit 
for both teachers and pupils. The teacher should see, during the 
first few minutes after school opens in the morning, whether each 
pupil seems up to what she has learned is his usual condition. 
Any pupil who deviates markedly from what is normal should be 
noted and given special attention, or "inattention," during class 


Evaluating Physical Aspects of Pupils - P rz 


work. Anyone who shows signs of an acute or communicable dis- 
ease should immediately be isolated until a decision is reached about 
his remaining in school for the day. 

A more systematic and complete physical examination should 
be made periodically during the school year, say at the beginning, 
middle, and end. The findings concerning each pupil should be 
entered in a permanent cumulative record. Record forms for this 
purpose have been devised by the American Medical Association, 
whose Health Inspection Form for Children is shown in Fig. 5. 


COMMUNICABLE DISEASES 


Symptoms and Periods of Communicability.—Most of the com- 
mon communicable diseases are readily transmitted soon after they 
set in; hence the teacher can perform a valuable service in prevent- 
ing their spread by detecting pupils with acute illnesses as soon as 
possible. Once she has become familiar with the normal appearance 
of her pupils, it is not difficult to recognize such deviations from 
the normal as indicate the onset of sickness. Some of the general 
signs of disease are: listlessness, weakness, drowsiness, flushed face, 
undue pallor, headache, sneezing, running nose, red and watering 
eyes, coughing, vomiting, or sore throat; also, an eruption on the 
face, neck, or arms indicates disease. Teachers will find useful for 
reference purposes a list of the early symptoms of some of the 
common diseases which attack school children. The appearance of 
these symptoms in any pupil should be sufficient cause to have that 
pupil sent or taken home. The symptoms of the common diseases 
for which the teacher should be on the alert are as follows: 
Measles.—Cold in the head, with sneezing, running nose, red and 

watering eyes, cough, fever. The eruption does not appear until 

the third day. 

Scarlet Fever —Vomiting, sore throat, fever; a fine scarlet rash ap- 
pears within 24 hours on the neck, chest, arms, and, to some extent, 
the face, 

Diphtheria—General signs of illness, There may be vomiting or a 
chill or only great prostration. The throat may be red and there 
may be a patch of gray membrane. The child may complain of 
sore throat. Fever is present, though it is usually not high. 


weioiSÁud EF Kq usas oq pinoys Qoya 100j9Q—X :199jap ju&icd» oN—O :3009 
juesqy—34uesé3d—i938 :NOILVNIOOVA “Sz 


wj) aeog—ujw4—pesg :NOILIQNOO 1YH3N39 Pe 
sennuiojeg 9q0—3iw]3 gy :1334 E? 
SePW-y00uX4—3891 MOT :S3!1IN3HIX3 Ce 
Tunyeaing—esopinoNs punog :3NidS "IZ 
peuiojeq :183H2 O? 

Tel z IT IT wyo :SHV3 6i 
39499 :QIONAHI Si 
pag :X)3N 40 SONVIS Zi 
Sujqyeoiq-qinou penge :SQION3QV 9I 


TULA uM uom py :1VOBHi GI 
poaousi—poureyu|—popin. 78118NO1 Pi 


Pepesg espas juvu3—snouvQ—iveuwÜnv :Hl331 ti 
qyeug jnoj—s9e.08 i$Xu?2—peusys; :SWNI ONY HLNOW ZI 
/suosg—uenonasqQ :3S0N II 


278 


Kais :8310$0W '8 
Ayseynoed Kuy :11V9 V 3Hü1SOd Z 


snow Kinpufi :1N3W1H0d3Q '9 


wd JUI WOON snojasid WET UN WOH S 
[INE EI 


N3NG'IIHO JOJ WAOA NOLLOSdSNI HI'IVSH 


ve soptaoid Tay SE 30 p RTL iain eb Auedurosse o1 oa j p10221 jo oprunsoe4—'$ “org 
- weposdes] Gupp vonod 10 


280 How to Evaluate 


A watery nasal discharge which irritates the upper lips should 
make one suspect nasal diphtheria during an epidemic. 

Tonsillitis—The throat is sore; there may be a chill or chilly sen- 
sations and usually high fever. There is great prostration. The 
throat is much inflamed, and yellowish spots may be present on 
the tonsils. 

Smallpox.—Chill, fever, backache, headache, nausea, and vomiting 
are usually present. The eruption appears on the second or third 
day. The symptoms may be very mild and the disease may be 
difficult to distinguish from chickenpox. 

Chickenpox.—An eruption of discrete, red, raised spots appears 
usually first on the forehead. There may be fever, but other 
symptoms are slight. 

Mumps—There is swelling of the parotid gland, in front of and 
below the ear, on one or both sides; this region is painful, espe- 
cially when swallowing; and there are general signs of illness. 

German measles—The symptoms are similar to those of measles 
but are mild. In about 50 per cent of the cases there is no fever, the 
first sign of the disease being the eruption which appears first on 
the face and consists of discrete spots of a deep pink color. 

In case of an epidemic of any disease, the teacher should, of course, 

become familiar with the symptoms of that disease. j 


QUESTIONS 


1. Enumerate some of the differences you would expect in evaluating 
physical aspects in the primary grades, in junior high school, and 
in senior high school. 

2. Make a report on the results of an attempt on your part to evaluate 
the physical aspects and need for medical attention of your neigh- 
bors or passers-by in the street. 

3. Should high school teachers who have a great many pupils for just 
one hour a day evaluate the physical aspects of those pupils in the 
same way as they evaluate the physical aspects of pupils in their home 
rooms? Explain. 

4. Will the manifestations of physical defects be more observable in 
some school situations than in others? Illustrate with reference to 
reading, music, lectures, slide illustrations, free study periods, social 
occasions, and gymnasium periods, and in terms of specific physical 
aspects of pupils. 


Evaluating Physical Aspects of Pupils 281 


Illustrate some of the implications for parent-teacher relationships of 
the evaluation of physical aspects of pupils. How must the teacher's 
work in this field fit into the socio-economic structure of the com- 
munity? 


REFERENCES 


. Eames, T. H., “The reliability and validity of the Eames Eye Test,” 


Journal of Educational Research, 33 : 524-527 (1940). 


. Indiana State Board of Health, Aids to the Teacher and Pupil in 


Health Promotion, Indianapolis, 1941. 


. Jenss, Rachel M., and Souther, Susan P., Methods of Assessing the 


Physical Fitness of Children, Washington: U.S. Department of 
Labor, Children’s Bureau Publication No. 263, 1940. 
Joint Committee on Health Problems in Education of the National 
Education Association and the American Medical Association, Con- 
serving the Sight of School Children, New York: National Society 
for the Prevention of Blindness, Publication 6, 1935. 


. Klein, A., and Thomas, L. C., Posture Exercises, a Handbook for 


Schools and for Teachers of Physical Education, Washington: U.S. 
Department of Labor, Children’s Bureau Publication No. 165, 1926. 


. Klein, A., and Thomas, L. C., Posture and Physical Fitness, Wash- 


ington: U.S. Department of Labor, Children’s Bureau Publication 
No. 205, 1931. 


. Rogers, J. F., What Every Teacher Should Know About the Physical 


Condition of Her Pupils, U.S. Department of the Interior, Office 
of Education, Pamphlet 68, Washington: Government Printing 
Office, 1936. 


CHAPTER XIV 


General Mental Abilities 


NOWADAYS ALMOST EVERYONE KNOWS THAT THE WAY TO EVALUATE THE 
mental abilities of pupils is by means of intelligence tests, It is also 
common knowledge that the necessary tests are distinctly technical 
instruments which must be purchased from their authors or pub- 
lishers, who develop them by means of highly refined statistical, 
psychological, and sociological procedures. The measurement of 
intelligence is the most highly developed of all the fields of evalua- 
tion, There is consequently little occasion or need for the classroom 
teacher to construct his own evaluation devices for the measurement 
of mental ability. The question raised in connection with achieve- 
ment evaluation, “Should the evaluation device be an externally 
made, standardized test or an informal, teacher-made test?” may 
therefore be immediately answered, in the field of mental ability, in 
favor of the standardized, purchasable test. The major practical 
question thus reduces itself to the problem of choosing a test from 
among the many available, 

The present chapter will seck to furnish the means for solving 
this problem while at the same time providing a general understand- 
ing of the construction of intelligence tests, Such an understanding 
is, of course, essential to the wise use and interpretation of measure- 
ments of intelligence. Thus, instead of the teacher's being concerned 
with the first two of the five major steps of evaluation (see pages 
121-123) as part of his own practice and procedure, he is here con- 
cerned with the purpose and content of intelligence measurements 
and with the technique of constructing an intelligence test merely 
as part of the theoretical basis and background for selecting a test 
and then carrying out the next three steps: administration, inter- 
pretation, and evaluation of the device. 

282 


‘General Mental Abilities 283 


Purpose or INTELLIGENCE TEsts 


E 
Accordingly, this discussion must begin with a definition of the 
pose of the evaluation of mental ability, as the first problem 
nfronted by anyone undertaking to construct an intelligence test. 
‘The discussion in Chapter IV has already supplied the major part 
ol answer to this problem, We saw there, however, that the most 
ningful definition of intelligence for practical purposes emerges 
only when we perform operations or construct tests to measure it. 
Let us turn, therefore, to the intelligence test builders, to see what 
rations they performed in constructing their tests. What were 
purposes? How did they select their test content to achieve 
these purposes? 
" The first successful intelligence tester, Alfred Binet, was seeking a 
technique with which to separate the school children of Paris into 
‘those who were mentally fit and those who were unfit to be taught 
dn the regular Parisian public schools. After great effort in many 
ent directions, such as head measurements, physiognomy, 
graphology, palmistry, and sensory discriminations, Binet finally 
“achieved a somewhat successful approach in terms of the “higher 
ntal processes.” The point here, however, is that his criterion of 
test’s validity (to use contemporary terminology) was ability to 
d in school work. The content of his measuring instruments 
derived not immediately from an a priori, armchair reasoning 
, but rather by an almost random process of trial and error 
d a practical goal. This criterion, “ability to succeed in school" 
or “general scholastic aptitude,” has pervaded most subsequent 
crational definitions of intelligence, Whatever may be the fate 
of such general definitions as were noted in Chapter IV or of 
‘theories of the organization or determiners of intelligence, ability 
to succeed in school must always remain a major part of our con- 
ception of intelligence, at least in our present civilization. Ever 
ce Binet, teachers’ judgments of pupil intelligence and teachers’ 
ks, all based upon judgments of the pupil's success in dealing 
school subjects, have been important criteria of the validity 
Measurements of intelligence. 
—— "The question may be raised, “If school success and teachers’ judg- 
‘ments of intelligence are so important a criterion of intelligence 


284 How to Evaluate 


tests, then why do we need special instruments?" The answer is 
that teachers’ unaided judgments are not sufficiently reliable, valid, 
or in many practical situations administrable. Binet investigated the 
ways in which teachers judge the intelligence of their pupils both 
by using a questionnaire and by asking three teachers to come to 
his laboratory to judge the intelligence of children whom they had 
never seen before. He found that teachers “do not have a very 
definite idea of what constitutes intelligence, . . . tend to confuse it 
variously with capacity for memorizing, facility in reading, ability - 
to master arithmetic, . . . fail to appreciate the one-sidedness of the 
school's demands upon intelligence, . . . are too easily deceived by 
a spritely attitude, a sympathetic expression, a glance of the eye, or — 
a chance ‘bump’ on the head, . .. show rather undue confidence 
in the accuracy of their judgments.” His observations of the teach- 
ers’ actual procedures indicated that each one attempted to con- 
struct on the spot a poorly thought-out, unstandardized, errorful 
little intelligence test which resulted in little agreement among the 
estimates of all three. “The teachers employed very awkwardly a 
very excellent method" (xx : 28-35). The intelligence test method of 
evaluating mental ability thus emerges as merely a refinement, 
organization, and standardization in scientific form of the teacher's 
or layman's “common-sense” but error-ridden approach. 


CRITERIA FOR CONTENT or INTELLIGENCE "TrsTs 


Let us now examine the criteria employed by Binet and subse- — 
quent intelligence test builders in the selection and arrangement of — 
their test content, Every builder must select his content according 
to some conception of intelligence, either explicitly formulated or 
implicitly assumed. Binet’s conception stressed such characteristics 
of thinking as its power to keep going along purposeful lines, to 
criticize itself, and to adapt itself to a specific goal. The manifesta- 
tions of this conception took many varied forms in test content: 
tests of time orientation, several kinds of memory, language compre- 
hension, knowledge about common objects, free association, num- 
ber mastery, constructive imagination, and ability to compare con- - 
cepts, to sce contradictions, to bind fragments into a unitary whole, 
to comprehend abstract terms, and to meet novel situations (xx : 46). 

In addition to its agreement with some conception of intelligence, 


General Mental. Abilities 285 


test content is subjected. to a second criterion—equal opportunity 
of human beings within a given culture to have experience with 
that content. In setting up this requirement intelligence test build- 
ers have hoped to hold constant the factor of environment or 
learning so that the differences between individuals that emerge 
from the testing will be ascribable to innate hereditary factors. To 
a large extent this attempt is justifiable, especially in so far as it 
excludes test content which is explicitly taught in the schools. Thus 
it would be folly to use shorthand writing ability to measure high 
school students’ intelligence, since learning opportunity or environ- 
mental influences vary so greatly from pupil to pupil with respect to 
such test content. On the other hand, test builders and users must 
realize that this requirement—equal opportunity to learn test con- 
tent—is seldom even closely realized, It is this unwarranted assump- 
tion, that intelligence tests cover only what everybody has had an 
equal opportunity to learn, that is responsible for much confusion 
in the nature-nurture controversy and for much unwarranted 
fatalism and determinism in the interpretation of test results. 

To rule out differences in the previous experiences or training 
of pupils, the attempt is made to use either equally familiar or 
equally unfamiliar material. Equally familiar material may be il- 
lustrated by such tests as asking a child to point to his nose, to 
draw a picture of a man, or to complete a picture in which one 
of the shadows is missing. Equally unfamiliar material may be 
illustrated by tests in which the individual is required to learn 
and apply an artificial language composed of nonsense syllables, 
or by maze tests. Test material in which environmental factors 
probably play the major role may be illustrated by vocabulary tests, 
arithmetic computation tests, and any other test involving words or 
mathematical symbols. This latter kind of material, obviously in- 
fluenced to a large extent by learning opportunities and by the 
types of activity in which the individual is constrained to engage 
by the culture in which he lives, constitutes by far the major pro- 
portion of the content of most intelligence tests. This fact, while it 
may invalidate the tests as measures of innate hereditary ability, 
whatever that may mean, does not of course detract from their 
usefulness in guidance. Whether or not the ability reflected in a 
pupil’s score is determined by his genes or by his environment, 


286 How to Evaluate 


that ability must still be taken into account in making vocational 
plans or in explaining school success or failure. 

An effort is always made in selecting intelligence test content to 
avoid material which is explicitly presented to some groups in a 
given culture while withheld from other groups in the same culture, 
as for example the kind of material included in any specific achieve- 
ment test. The test aims rather to make a general sampling of the 
mental processes and information which are prevalent throughout, 
9r as common as possible in, a given culture, such as that of the 
United States at the present time. Inevitably, however, such a 
sampling will have a great deal in common with what is taught in 
our schools, { 

This is seen from the high correlations which are always ob- 
tained between a test of general intelligence and a battery of general 
achievement tests. T. L. Kelley (9), among others, has shown that 
this is the case, He obtained correlation coefficients, corrected for 
the unreliability of the tests correlated, which enabled him to 
estimate the community of function between intelligence and 
achievement tests. He concluded (9 :208) that the community of 
function among different intelligence tests is about 95 per cent; 
between intelligence tests and achievement batteries, about go per 
cent; between intelligence tests and reading tests, about 92 per cent; 
between intelligence tests and arithmetic tests, about 88 per cent. 
Although these figures are based on grade school pupils, similar 
results between achievement and intelligence test scores would 
probably be obtained at any other age level. Kelley derived from 
these data his point of view that the distinction between intelligence 
and achievement tests is largely spurious and is due to the “jangle” 
fallacy, which leads us to believe that two Separate words or ex- 
pressions for the same thing necessarily carry a real distinction 
between them. 

JE intelligence and achievement tests do cover the same ground 
to Such a large extent, why do we need intelligence tests? The 
answer must be given in terms of practical usability and con- 
venience. General scholastic achievement cannot usually be meas- 
ured in as brief a time and with as short a test as can intelligence, 
Achievement in a single subject is usually measured by a test that is 
as long as the average intelligence test; to obtain a measure of 


General Mental Abilities 87 - 


achievement would require several such achievement tests. 
arly, for pupils with varying educational background—from 
ent school systems, for example—it is difficult to select an 
shievement test that is not prejudicial to some of them. 
"Thus Ruch and Segel (xo : 24) state that verbal group intelligence 
s are of most value when (1) they are used with achievement 
Íor pupils whose school attendance has been irregular, 
transfers from other schools must be given tentative class assign- 
without delay even though adequate cumulative records are 
available, (3) financial and other facilities permit extensive test- 
ing programs including both educational and mental tests, and 
4) mental tests provide diagnoses of differential mental abilities 
h as memory, attention span, symbolic thinking, and so forth. 
Another advantage claimed for the intelligence test over general 
hievement test batteries as measures of scholastic aptitude is that its 
elty and apparent unrelatedness to ordinary school subject matter 
te greater interest and better testing morale on the part of 
dents who may have antipathies toward one or more school sub- 
s. A child may show low achievement because of neuroticism, 
dom, poor teaching, or other non-intellectual reasons. The 
ental test may be able to reveal abilities which have escaped the 
uence of such debilitating factors. 7 
Kelley uses his argument of the community between intelligence 
ts and achievement tests mainly to inveigh against the accomplish- 
ent quotient (A.Q.), “achievement age divided by mental age” 
i owever, he does admit the possibility of an individual's differing 
in intelligence and achievement and the need of intelligence tests 
| eal such differences. Especially if the achievement in which we 
“are interested is not general but special—for example, music, compu- 
tion, spelling, handwriting, etc—then the difference between it 
and intelligence may easily be real and one which is demonstrable 
by means of an intelligence test and a measure of the special achieve- 
“ment in question, 
Let us now return to the discussion of the criteria used in selecting 
the content of intelligence tests. Having chosen and constructed a 
tion of test situations, Binet tried them out on normal children 
different ages. Then he applied a technique of evaluating the test 
ial in terms of its difficulty and validity which served to insure 


288 How to Evaluate 


even more the success of his new type of intelligence test content: he 
determined the percentage of success in a given test situation for 
each age group of children tested. If this percentage increased little 
or not at all or decreased in going from younger to older children, the 
test was declared invalid on the justifiable assumption that the 
average intelligence of normal children (that is, the mental age) 
increases with chronological age. If, however, the Percentage of 
Success increased rapidly with age and, furthermore, increased within 
a given age group in accordance with elsewhere-obtained judgments 
of the intelligence of the children in that group, the test situation was 


judged valid and retained for the final form of the test. The age level A 


at which 66 to 75 per cent of the children passed a test was used to 
define the difficulty of that test situation in terms of its age level. 
This way of defining difficulty was useful, as we shall see later, in 
interpreting the scores of pupils. 

Similar in principle to the criterion of an increasing percentage 
of success with increasing chronological age is the use of the total 
test score as a criterion for the validity of each separate item of test 
content. Using the total test score as the best available measure of 
intelligence, the test builder requires that each component of the 
test, each test situation or question, shall discriminate between those 
who make high, medium, and low total scores on the test. If the 
Pupils are divided according to their total score into high, middle, 
and low classes, then the percentage of success with a given item 
should rise in proceeding from the low to the high group. This is 
merely another way of saying that the correlation between single 
test items and total scores should be as high as possible. In practice 
this is achieved by any one of a great number of available techniques 
of statistical item analysis. 

After the material has been selected according to its validity, 
attention is paid to other aspects of test content relating to its 
administrability and practicality. Such considerations as ease of 
scoring, appeal to the student, time requirements, materjal or ap- 
paratus needed, and equal difficulty for both sexes are weighed in 
the balance along with item validity. Large quantities of seemingly 


acceptable test material are often rejected on one or more of these — 


grounds. Many provisional test forms are constructed and tried out 
on appropriate samples of pupils and the results are critically 


id 


General Mental Abilities 289 


examined so that only the most satisfactory test materials will be 
retained. 

Finally one or more final forms of the test are constructed and 
administered to large representative samples of pupils, the samples 
being selected so that proper numbers of pupils are drawn from all 
the groups according to factors which may have an effect on the 
test results. The standardization of the Revised Stanford-Binet test 
of intelligence thus took into account geographic distribution, socio- 
economic status, rural or urban residence, and nationality or descent 
of the pupils tested in each age range. At this point the selection 
and standardization of test content requires much highly skilled 
labor and statistical analysis and involves no little financial expense, 
Here as nowhere else the individual teacher or school cannot com- 
pete with the standardized product of well-financed research pro- 
grams. The inadequacy of a teacher-made intelligence test thus 
becomes obvious. 


Types or INTELLIGENCE Txsts 


Even more than other types of evaluation devices, intelligence tests 
are classifiable as individual or group according to whether they are 
administrable to only one or to more than one person at a time, 
A prime example of an individual intelligence test is the Stanford 
Revision of the Binet-Simon Scale; the group test may be illustrated 
by such tests as the Terman-McNemar Test of Mental Ability, the 
Henmon-Nelson Test of Mental Ability, the American Council on 
Education Psychological Examination, and the Ohio State Uni- 
versity Psychological Test. The chief disadvantage of individual 
testing is, of course, the large amount of time required and the 
consequent high cost. Furthermore, the skill needed for the proper 
administration of an individual intelligence test is usually gained 
only in a one-semester course in psychology at the university level, 
Preceded in most universities by two or more introductory courses 
in psychology. 

The advantages usually cited in favor of individual intelligence 
testing are its greater reliability and validity, and its interpretability 
in clinical terms. The first two—higher reliability and validity—are 
largely illusory, for most available group tests have reliabilities higher 

an .85 and correlations with such criteria as school marks or 


290 How to Evaluate 


teachers’ ratings ranging upward from 4o. Neither of these figures — 


is sufficiently, if at all, exceeded by individual intelligence tests to 
warrant the selection of the latter on these grounds. The other ade 


vantage, interpretability in clinical terms, refers to the possibility - 


of observing more carefully the behavior of the pupil during the 
test, and of making judgments concerning his emotional adjust- 
ment, motivation, working habits, physical aspects, special interests, 
and other aspects in addition to his general intelligence. Further- 
more, the individual test affords an Opportunity for eliminating 


errors of measurement due to faulty understanding of directions, ; 


lack of interest, or fatigue. These factors may seriously invalidate 


group tests when given to children below eight years of age, to 
mentally defective children, or to badly maladjusted children. Bing- 


ham (2:228) has noted this advantage even with adults: “, . . dis- 
traught persons who really required professional psychiatric service 
were less often identified as such by the counselors who talked with 


them intimately about their problems than by the examiners who _ 


administered the individual performance tests and so had a chance 
to observe them under exceptionally revealing circumstances.” 


The advantages of individual tests can be greatly enhanced by 1 


the use of the excellent Examiner's Check List for Use in Noting and 


Interpreting. Behavior During the Test Period, constructed by | 


F. Baumgarten and translated from the German by F. J. Keller 
(2 : 229-235). This list presents points of behavior to be noted during 
the preliminary instructions and the execution of the task, the sub- 
ject’s attitude toward his performance, and his conduct at the end 
of the test and after the testing. The detailed and comprehensive 


* 


2 


nature of the list and the psychological insight evident in the sug. - 


gested inferences or interpretations offered for each item of behavior 


noted make it a valuable adjunct to any program of individual testing. 
Whatever advantages the individual intelligence test may have for 


testing very young children and for enabling clinical observation, the 
fact remains that it is impractical for most classroom teachers in the 
Vast majority of situations. Since most of the evaluation of mental 


abilities can be made satisfactorily by means of group tests and since | 


an adequate discussion of individual tests requires more space for 
each test than the present chapter can afford, only group tests will 


be discussed. The reader is referred to books specializing in individual 1 


testing (3, 5, 6, 12). 


General Mental. Abilities 291 


Another distinction between intelligence tests is that between 
performance and verbal tests. 'The latter, which require that the 
pupil be able to use and understand the English language, com- 
prise by far the majority of group intelligence tests. Performance 
tests enable the measurement of pupils who are unable to read or 
write English, such as pre-school children, foreign-born children, 
children with speech defects, deaf children, and others whose use of 
language is limited. Sometimes a further distinction is made between 
performance and non-language tests. The former require the sub- 
ject to do something with concrete materials, such as form boards, 
blocks, or picture puzzles; the latter demand merely a minimum of 
written or spoken language and involve no concrete material to be 
manipulated. In general, performance tests are usually (but not 
necessarily or always) also individual tests, while non-language tests 
are usually designed for group administration. 


Types or MATERIAL INCLUDED IN INTELLIGENCE Tests 


We have space here for only a few illustrations of the many dif- 
ferent types of material which have been included in intelligence 
tests. The number of different kinds of material which may be con- 
ceived of as requiring intelligence is probably nearly infinite. One 
indication of the great variety of tests which are considered to elicit 
intelligent mental activity may be obtained from Thurstone’s attempt 
(13), by the statistical method of factor analysis, to simplify it into 
a smaller number of “primary mental abilities.” He used a battery 
of 56 psychological tests chosen to represent a fairly wide range of 
the mental activities that are typical of current tests of general in- 
telligence. In the physical world thé number of forms which matter 
can take is extremely large, although each of these forms consists of 
one or more of only 92 chemical elements. Thurstone’s attempt may 
be likened to the chemist's attempt to break down a large number 
of compounds into a much smaller number of elements. We are 
concerned with his attempt only in so far as it indicates the tre- 
mendous variety of intelligence test material, for Thurstone's bat- 
tery of 56 tests was itself merely a small sample of possible in- 
telligence tests. 

In selecting materials to illustrate those included in intelligence 
tests we shall draw from actual tests from which teachers may select 
one or more to meet their own specific needs. That is, the following 


292 How to Evaluate 


discussion will serve not only to illustrate test materials but also to 
provide data relevant to decisions concerning what test to use. The 
factors to be considered in selecting a standardized intelligence test 
are similar to those pertinent to the selection of standardized achieve- 
ment tests (see Chapter X). Consequently, in discussing available 
intelligence tests we shall be concerned with (1) validity, (2) relia- 
bility, (3) administrability, and (4) interpretability. Only a limited 
number of the many available tests are discussed; these have been 
selected as probably most serviceable in meeting specific needs. It is 
impossible to prescribe a single test as most satisfactory for all situa- 
tions. Literally hundreds of authors have attempted to construct 
and standardize mental ability and scholastic aptitude tests, Of the 
type of tests we are discussing, Hildreth (8) lists 121 for elementary 
school children, 39 for high school students, and 79 for college and 
adult levels. 


ILLUSTRATIONS oF Sprciric Types or Menrat Test CoNTENT 


1. Common Observation. (From the Pintner-Cunningham Pri- 
mary Tests: Form A.) 
Oral directions (given by test administrator) : 


Page 2 TEST 1 


V ta 


Fic. 6.—Test material from Pintner-Cunningham Primary Test. 


General Mental. Abilities 293 


Look at the pictures at the top of the page. (Indicate the first TOW, 
page two.) We are going to put some marks on some of the pictures, 
but not on all of them. Let's mark the things that mother uses when she 
sews her apron. Can you find the scissors? Put your finger on it. Mark it 
like this. (Draw line on board.) Mark just the scissors and not anything 
else. (See that all have followed directions. Any kind of mark that in- 
—— dicates the object is satisfactory.) Now we are going to look for some- 
thing else that she would need in sewing her apron. She would not need 
à coffec pot. We must not mark it. Would she need a thimble? Yes, 
mark it. (Pause.) She would not need a kettle. Do not mark the kettle. 
There are two more things that she would need to sew her apron. Find 
them and mark them. Be careful—mark just the right things. (Allow a 
reasonable time.) Now look at the rest of the pictures on this page. These 
pictures show some things that have feathers and some that do not have 
feathers. Find all the things that have feathers, and mark them. Go 
ahead! (Time, 30 seconds.) 


Other Pintner-Cunningham tests, using similar pictorial material, 
are entitled aesthetic differences, associated objects, discrimination of 
Size, picture parts, picture completion, and dot drawing. This ma- 
terial illustrates the avoidance of words and numbers in tests de- 
signed for pupils of the third grade in school. 

2. Vocabulary. (From the California Test of Mental Maturity— 
Intermediate Series.) 


i Directions: Draw a line under the word which means the same or 
about the same as the first word. Write the number of this word on the 
line to the right, as: 


I. strange 1. real — 2. tell 3. certain 4. unknown /— — 1 
2. reply — rz, news 2. answer 3. note 4. open a 
49. odium 1, favor 2. blame 3. smell 4. poem 49 
50. chuff 1. peeve 2. churl 3. cliff 4. laugh 50 


This type of vocabulary test of appropriate degrees of difficulty 
Occurs ín most group mental tests from the fifth grade to the uni- 
Versity adult level. This kind of verbal material is, of course, greatly 
influenced by learning opportunities and environmental stimulation, 
It is obvious that whatever intelligence is reflected in success with 
such material is influenced by much else besides hereditary, innate 

tors. 


294 How to Evaluate 


Of the sixteen subtests in the California Test of Mental Maturity, | 
the first three are tests of visual acuity, auditory acuity, and motor — 
coordination designed to enable the detection of those whose physical. 
defects in these respects would prevent them from performing on 
the other subtests to the full extent o£ their intellectual abilities. Sub- 
tests four and five, immediate recall and delayed recall, are grouped. 
under the heading Memory, The next three subtests, sensing right - 
and left, manipulation of areas, and foresight in special situations, 
are grouped under the heading Spatial Relationships. The next seven 
tests—opposites, similarities, analogies, number series, numerical 
quantity (non-language), numerical quantity (language), and in- 
ference—are grouped under the heading Reasoning. Subtest sixteen 


factors which, the authors of the test contend, are more helpful in 
diagnosis and guidance than a single total score. 1 


3. Information. (From the Terman-McNemar Test of Mental 
Ability, Form C.) E 


Directions: Mark the answer space * which has the same number as the 
word that makes the sentence TRUE, Ü 
) Sampte. Our first President was 


1 Adams 2 Washington 3 Lincoln 4 Jefferson 5 Monroe 
1. Polo is a kind of 


I disease 2 work 3 bear 4 game 5 language 
2. Herring is a kind of 


6 wig — 7 flower — 8 pattern 9 jewel ro fish . 
19. Emeradsareusadly — 7 7 eon oo oc end {i 


1 red 2 yellow 3 green 4 purple — 5 blue 
20. Sirloin is a cut of 


6 mutton — 7 beef 8 veal — glamb 10 pork 


1Five answer spaces are provided for machine or window stencil scoring. 


General Mental. Abilities 295 


This material, like the vocabulary test, is so similar in nature to a 
generalized achievement test as to indicate again the difficulty of 
distinguishing between intelligence and achievement. This similarity 
provides a non-statistical basis for confidence in Kelley's finding of 
almost perfect correlation between general scholastic achievement 
tests and general intelligence tests. 

4. Disarranged Sentences. (From the Henmon-Nelson Test of 
Mental Ability, Grades 7-12, Form B.) 


15. staff bread the life called is of. If these words were arranged 
to make a good sentence, what would be the word after 
"o£"? 1. staff 2. bread 3. the 4. life 5. called 

42. substance made a bricks called and clay are from pottery. 
If these words were arranged to make a good sentence, 
, what would be the word after "substance"? 1. bricks 2. clay 
3. called 4. are 5. pottery, ISU ile ti uti eit TE 


The material of the Henmon-Nelson Test is not segregated but 
is arranged in a "scrambled" sequence, with the different types of 
test items following one another in apparently random order but in- 
creasing in difficulty. The wide variety of items includes vocabulary, 
sentence completion, disarranged sentences, classification, logical 
selection, series completion, directions, analogies, anagrams, proverb 
interpretations, and arithmetic problems. This arrangement has the 
advantage of simple administration, because the examiner need not 
Start and stop the pupils for each subtest, as is necessary when the 
different types of material are segregated and timed. 

5. Verbal Analogies. (From the Ohio State University Psycho- 
logical Test, Form 21.) 


Directions: Find among the five numbered choices a word that fits the 
third word in the same way that the second word fits the first word. 


t. boy boyish infant 1. childish 2. infantlike 3. infant 4. in- 

. fantile 5. infantil 

2. sight sce eulogy 1. eulogy 2. eulogue 3. eulogize 4. eulogies 
5.eulogie. i5 5.3, hee eA ERR ME ene SA des «Sh aan 

3. virtuous virtue demented 1. demention 2. dement 3. de- 


mented4. demente 5. dementia — —  — 1 


296 How to Evaluate 


This type of verbal test material requires not only a knowledge 
of vocabulary and spelling but also the ability to see verbal relation- 
ships. The other two types of test content in the Ohio State Uni- 
versity Psychological Test are same-opposite, multiple-choice vocabu- 
lary items, and paragraph reading, in which the student is required 
to read a series of paragraphs and then answer after each paragraph 
a series of multiple-choice questions based on it. There are no time 
limits to the test so that every pupil has a chance to attempt every 
item. It should be noted here that, according to Toops (x4 : 2287), 
“with but few, if any, exceptions, this test has been under continuous 
improvement longer and has had more research expended upon it 
than any other intelligence test.” Its scoring has been greatly facili- 
tated by providing answer pads in which the pupil punches his 
answer with a stylus, also provided, and thereby marks three answer 
sheets at a time, Scoring can be done and checked independently by 
several of the pupils themselves, 

6. Spatial Analogies. (From the Psychological Examination by 
L. L. Thurstone and T. G. Thurstone, 1931 edition.) 


Directions: Underline that figure which has the same relation to the 
third figure as the second has to the first, 


43 $= -nimi vga cezesB 


XHM MMMAM amaA Q9e969 


7. Number Series. (From American Council on Education 
Psychological Examination for College Freshmen, 1938 Edition.) 
Directions: The numbers in each row follow one another according 
to some rule. You are to find the rule for each line of numbers. Some 
missing numbers are indicated by blanks. The other missing number 
is indicated by the X at the end of the series, It is the value of X in each 
row which you are to mark, 
15—25 30—40 45 X 
98 87 76 65——X 
45r Erg 


General Mental Abilities 297 


This test obviously measures some form of mathematical ability. 
The other subtests of the American Council Psychological Examina- 
tion are arithmetic, analogies, completion, artificial language, and 
same or opposite vocabulary. The examination provides two scores, a 
linguistic, or L-, score based on the composite of the same-opposite, 
completion, and verbal analogies test scores; and a quantitative, or 
Q^ score derived from the arithmetic, number series, and figure 
analogies test scores. The linguistic score is weighted roughly twice as 
much as the quantitative score since, in the test authors’ view, a 
scholastic aptitude test should be rather heavily “saturated” with 
language factors. This twofold analysis of mental ability is similar 
in purpose to that provided by the California Tests of Mental 
Maturity, These two tests are typical of the current trend toward 
breaking down mental ability into separate factors, at least into the 
verbal and the mathematical. This breakdown, in line with the 
“gravel” theory of the structure of intelligence (see Chapter IV), 
will perhaps provide the basis for a more specific guidance of 
pupils than is possible when intelligence is viewed and measured 
as a general whole. 

In Table 7 are presented certain facts concerning a selection of 
group mental ability tests which, at the present time, represent some 
of the best ones available. Much better tests than these may appear in 
the future. 

Teachers are urged to be on the alert for improved tests. Here we 
can merely provide teachers with the general equipment for select- 
ing and evaluating tests, with specific mention of some good con- 
temporancous ones. No data concerning the validity and reliability 
of the tests are given in the present discussion since, beyond the 
general assurance that all the tests have been found to be as highly 
valid and reliable as any available, their validity and reliability in any 
Specific school situation must be determined specifically. Similarly, 
the interpretability of these tests is uniformly good, both in the types 
of norms provided and with respect to the size and nature of the 
Broups of pupils on which these norms are based. 

The chief respect in which these tests differ among themselves is 
in their administrability, that is, the grades or ages to which they 
are applicable, the cost, and the time required to administer them. 
Teachers should select the test which meets their specific needs 


298 . How to Evaluate 
Taste 7.—GzNzRAL MzNrAL Anurry Tests 


Grades Time (min.) Publisher No. 
Designed No. of Forms Required (see list on 
for to Give pp. 561-562) 


Name of Test 


California Tests of Men- Kgn.-r, 1-3, iforms,short 90 or 45 
tal Maturity 4-8, 7-10, or long q 
9-adults 1 
Pintner General Ability Kgn.1, 3-4, 2 forms, 45 38 "M 
Tests 5-8, 9-12. 3 levels D 
Kuhlmann-Anderson In- 1B, 1À,2,3, 1 form, 40-60 14 
telligence Tests 4567-8, 9levels 
giz 
Detroit First-Grade Intel- 1-2 1 form, z edi- 30-35 28 
ligence Tests tions 
Otis Quick-Scoring Men- 1-4, 4-9, 9- 2 forms at 25 38 . 
tal Ability Tests 16 each level 30 
Terman-McNemar Test 7-12 40 38 
of Mental Ability 
Henmon-Nelson Tests of 378, 7-12, 13- 2 at each level 35 19 
.. Mental Ability 16 
Ohio State University 9-16 and new form 120-180 25 
Psychological Test adults each year 
ican Council on Edu- 9-12, 13 new form 6o 2 
cation Psychological each year; 
Examinations for High two levels 
School Students and 
College Freshmen 


with respect to these aspeci 
the cost of a test may frequent]. 


General Mental. Abilities 299 


be comprehended under the two major language systems of our 
culture: the verbal and the mathematical. That is, thinking and 
intellectual processes in present-day society are carried on mainly 
in terms of words or mathematical symbols, Since this is the case, 
it is obvious that measures of the general achievement of pupils in 
these areas should be able to serve many of the same purposes as 
mental ability tests do. 

The Stanford Achievement Tests, for example, provide for grade 
school pupils a battery of tests which approach in terms of school 
subject matter many of.the same verbal and mathematical abilities 
at which general intelligence tests aim. The primary battery, for 
Grades 2 and 3, includes tests in paragraph meaning, word mean- 
ing, spelling, arithmetic reasoning, and arithmetic computation. The 
intermediate battery, for Grades 4 to 6, and the advanced battery, 
for Grades 7 to 9, include tests in paragraph meaning, word mean- 
ing, language usage, spelling, literature, arithmetic reasoning, arith- 
metic computation, social studies, and elementary science. 

The highly refined procedures used in the selection of the test 
materials for curricular and statistical validity, in the development 
of methods of administration, and in the determination of norms 
and interpretative devices all render this battery a distinctly valuable 
aid in the evaluation of grade school pupils. The comparability of 
the scores from subject to subject and from form to form enable the 
determination of pupils’ strengths and weaknesses as between dif- 
ferent subjects and the charting of an individual pupil's progress 
in each subject from year to year. 

At thé secondary school level a similar set of achievement tests 
Whose scores are comparable from subject to subject is published 
by the Cooperative "Test Service. These will, however, be considered 
Separately in the following discussion of tests of specific verbal and 
mathematical abilities, 

Reading.—The verbal factors of mental ability take the forms of 
reading, composition, mechanics and grammar, spelling, and hand- 
Writigg. The evaluation of handwriting has already been dealt with 
it Chapter XI. Let us now turn to each of the other aspects of verbal 
ability, briefly examining its nature and importance and the tech- 
niques by which it may. be evaluated. 

The importance of reading follows from its indispensability in our 


300 How to Evaluate 


society as a tool for the acquisition of ideas, This basic tool receives 
major emphasis throughout the years of primary and elementary 
schooling. Perhaps more than any other skill, the ability to read 
comprehendingly and speedily determines not only the pupil's suc- 
cess with other school subject matter but also his fitness for the world 
of work and the responsibilities of citizenship. Consequently there 
has emerged outside the field of general mental ability testing, 
which is itself implicitly concerned to a large extent with reading 
ability, a large body of literature and techniques for the evaluation 
of reading ability. 

It is perhaps more proper to speak of reading abilities than of 
reading ability because the skills of pupils vary with the nature of 
the material read and the purposes for which it is read. Pupils and 
adults may read to keep informed concerning current events, to 
follow directions, to secure pleasure during leisure hours, and, of 
course, to learn school subject matter. Wide individual differences 
in reading ability exist not only at the elementary school level but 
also at the high school and college levels, Many of the learning dif- 
ficulties of students in high school and college have been traced to a 
lack of ability to read with adequate understanding and speed. 

"Three levels of reading comprehension have been identified 
- (7:897): (1) "Understanding of words, the rate at which words 
and phrases are recognized, the rhythmical progress of perceptions 
along the line,” (2) “the fusion of the specific meanings represented 
into a chain of related ideas, . . .” and (3) ability to "reflect on the 
significance of the ideas presented, evaluate them critically, and 
make application of them in the solution of problems." 

Another aspect of reading that is of special interest to the primary 
grade teacher is reading readiness. Not all the pupils who enter the 
first grade are equally able to profit from instruction in reading. 
Reading readiness is the pupil's development along those lines 
which are related to his success in learning to read. Evaluations of 
it are uscful in predicting success in learning to read and they enable 
the postponement of reading instruction for those pupils who are 
not yet ready for it. Such evaluations should ideally be compre- 
hensive, taking into account the pupils mental age, interest in 
reading, visual and auditory sensory capacities, and emotional and 
social adjustments. Most of these factors, however, are disregarded 


General Mental Abilities 301 


by available reading readiness tests, since the factor of adminis- 
trability restricts them to paper and pencil situations. These tests 
differ from intelligence tests in that, by seeking to correlate highly 
with only one arca of scholastic success, their materials can be selected 
so as to exclude factors more obviously related to other areas, such 
as arithmetic, than to reading. The distinction is not very marked, 
for the available evidence indicates that their superiority in predictive 
power over intelligence tests is slight (7 : 911). In corjunction with 
intelligence tests and judgments of other kinds of pupil maturity, 
reading readiness tests do enable the first-grade teacher to dis- 
criminate between pupils who are and those who are not ready 
to receive reading instruction. 

"Table 8 presents data concerning a representative group of avail- 
able reading tests. The Lee-Clark Reading Readiness Test con- 
sists of four subtests, two using capital letters and two using lower 
case letters, The pupil is required either to draw lines connecting the 
letters that are alike or to cross out the letter that does not belong 
with three others, 

The Metropolitan Reading Readiness Test should be given, ac- 
cording to the authors, to small groups of children, preferably 
fewer than ten. The six subtests involve recognition of similarities 
in pictorial material and in symbols, the copying of figures, the 
selection of pictures that illustrate the word the examiner names, the 
comprehension of phrases and sentences rather than individual 
words, the exercise of number knowledge, and the exercise of 
common knowledge in selecting from a row of four pictures the one 
that satisfies the examiner’s description. 

The Van Wagenen Reading Readiness Test measures, in six 
subtests, the pupil's range of general information, ability to detect 
and use relationships, oral vocabulary, memory span for ideas, 
ability to discriminate between words, and number of repetitions 
required in learning the names of printed words. It is, however, an 
individual test and so is less practical in the testing of large class 
groups. 

The Betts Ready-to-Read Tests are an elaborate individual 
technique for determining physiological reading readiness, visual 
perception, auditory and articulation ability, fitness of the visual 

mechanism, and lateral dominance or sidedness of the hand, foot, 


302 How to Evaluate 


Taste 8.—VzmBAL AniLtTY Tests 
—— 
Grades Time (min.) Publisher No. 
Name of Test Designed for No.of Forms Required to (see list on 
Give pp- 561-562) 
e Á———— RB ———— T 


Lee-Clark Reading Readi- — Kgn.-r 1 15 7 
ness Tests 
Metropolitan Readiness Kgn.-1 1 7o 38 
Tests 
Van Wagenen Reading Kgn.-1 1 30 14 
Readiness Test Gindiv.) 
Betts Ready-to-Read I-17 1 30 
Tests Cindiv.) 
Stanford Achievement 3 $ 78 38 
Test 4-6 170 
79 170 
Gray Standardized Oral 1-8 5 10-15 28 
Reading Check Tests Cindiv.) 
Towa Silent Reading 479 2 65 38 
Tests: Elementary and 9-13 60 
Advanced 
Towa Every-Pupil Tests 35 3 6o 19 
of Basic Skills 5-9 3 70 
Davis-Schrammel Elemen- 4-8 2 25 4 
tary English Test 
Pressey English Tests for 5-8 3 4o 28 
Grades 5 to 8 
Progressive Language 13 3 4o 7 
Tests 4-6 
79 
97113 
Purdue Reading Test 716 a: 40 21 
Purdue Placement Test in 9-16 3 45 19 
English 
Cooperative Effectiveness 71 3 40 1L 
of Expression Test 11-16 
Cooperative Mechanics of 7-16 3 4o E 
Expression Test 
Cooperative Reading 71 3 40 I 
Comprehension Test 11-16 
Rinsland-Beck Natural 9-13 2 28 
Test of English Usage 
Test I, Mechanics 50 
Test II, Grammar 35 


Test III, Rhetoric 5o 


General Mental Abilities 303 


and eye. The measurement of oculomotor and perception habits in- 
volves the use of the Betts-Keystone Telebinocular mentioned in 
Chapter XIII. 

In the higher primary grades and the elementary grades, reading 
ability is measured in terms of both oral and silent reading. The 
Gray Standardized Oral Reading Check Test contains, for each 
grade level, five tests of about equal difficulty in the forms of three 
paragraphs of fifty words each. The pupil, tested individually, 
reads these aloud; the examiner records the time required to read. 
the selection and records errors by means of a code similar to a proof- 
reader's, ` 

The Iowa Silent Reading 'Test: Elementary "Test is widely used 
to test silent reading ability. It measures (1) rate of reading at a 
controlled level of comprehension, (2) comprehension of words, 
Sentences, paragraphs, and longer articles, and (3) ability to use 
skills required in locating information, i.e, the use of an index and 
the ability to select key words. The Advanced Test of this series 
provides similar measures at the high school and college levels. 
The first subtest, rate comprehension, involves the reading of para- 
graphs, marking the word reached at the end of one minute to 
Secure a measure of rate of reading, and answering multiple-choice 
questions based on the paragraphs. The second subtest, directed 
reading, requires the pupil to indicate the lines in a given selection 
in which the answers to various questions are to be found. The 
third subtest, general vocabulary and subject-matter vocabulary, 
Consists of words whose synonyms are to be selected from among 
four choices, The fourth subtest, paragraph comprehension, re- 
quires the selection of the best title for the paragraph and answer- 
ing three-choice questions based upon it. Subtest five, sentence 
Meaning, requires the pupil to answer yes or no to a list of 27 ques- 
tions whose truth or falsity is obvious when the question is compre- 
hended. Subtest six, alphabetizing, using guide words, and using 
an index, requires the pupil to locate words at given intervals in an 
alphabetical list, and to answer questions requiring the location of 
information in an index sample that is furnished. 

‘The Purdue Reading Test for use from the seventh grade through 
college contains a series of reading selections drawn from a wide 
Variety of materials of the type students are ordinarily required to 


304 How to Evaluate 


read. A variety of recall and recognition items are used in testing 
comprehension of the selections. 

English Usage.—English usage is another form of language ability 
for which many tests are available. "They usually involve ability to 
punctuate, grammatical classification, recognition of grammatical 
errors, evaluation of sentence structure, paragraph reading rate and 
comprehension, vocabulary, and spelling. A wide variety of in- 
genious constant-alternative and changing-alternative test items are 
used for these purposes. 

The Iowa Every-Pupil Tests of Basic Skills provide measures in 
this field for the elementary grades, as do also the Davis-Schrammel 
Elementary English Test, the Pressey English "Tests for Grades 5 
to 8, and the Progressive Language Tests. At the secondary and 
college levels representative tests are the Purdue Placement Test 
in English, the Cooperative Effectiveness of Expression Test, the 
Cooperative Mechanics of Expression Test, and the Rinsland-Beck 
Natural Test of English Usage. Test III of the latter, unlike most 
others, requires the pupil to write some actual sentences; scorability 
is thereby sacrificed to some extent, but, according to the test 
authors, validity is increased. 

Mathematical Abilities Mathematical abilities constitute roughly 
the second broad grouping of mental abilities which modern cor- 
relational studies have isolated. Similarly, they represent the second 
major class of symbolic material upon which intelligence may be 
considered to operate. Mathematics is essentially a language of 
quantitative terms and relationships. Its importance at all levels of 
complexity is so obvious in our civilization as to require little com- 
ment here. Table 9 presents data concerning representative mathe- 
matics tests. 

At the elementary level much attention has been paid to the 
diagnostic function of arithmetic tests as well as to general achieve- 
ment measurements, Thus the Compass Diagnostic Tests in Arith- 
metic provide a series of twenty tests, each covering in detailed 
fashion the basic arithmetic processes taught in Grades 2 to 8. The 
twenty tests measure more than ninety different aspects of arith- 
metic. Since the range of the battery is so great and each of the 
tests aims to be long enough to insure sufficient reliability for in- 
dividual diagnosis, each teacher must select from among them ac 


General Mental Abilities 305 


TABLE 9.—MarzEMATICAL ABILITY Tests 


No. of Forms Time (min.) Publisher No. 


Name of Test niic atEach Required to (see list on 
sda es Level Give Pp. 561-562) 

Compass Diagnostic Tests 2-8 1 25-60 accord- 31 
in Arithmetic, 20 parts ing to part 

Reavis-Breslich Diagnos- 7-12 I 35 
tic Tests in the Funda- 
mental Operations of 
Arithmetic and in Prob- 
lem Solving 

Schorling-Clark-Potter 5-12 2 45 38 
Arithmetic Test 

Orleans Algebra Progno- 7-9 1 9o 38 
sis Test 

lowa Algebra Aptitude 9 1 45 5 
Test 

Cooperative Algebra Test gx 1 45 m 

Breslich Algebra Survey ga 2 50 28 
Test gb 2 60 

Columbia Research Bu- 9-14 2 110 38 
reau Algebra Test 

Cooperative Intermediate g2 7 45 or 9o 1L 
Algebra Test 

Orleans Geometry Prog- 9-1 r 80 38 
nosis Test 

Towa Plane Geometry Ap- 9-12 I 50 5 
titude Test 

lowa Placement Exami- i 2 40 5 
nations, Mathematics 
Training 


Se cr er Ee dors oe sa iden T 


cording to her particular instructional or diagnostic needs. Typical 
situations in which these tests are recommended are as follows: after 
completion of an important instructional unit, for certain or all of 
the pupils of a new entering class, for individual pupils who fail 
to make normal progress in learning a new skill. The Schorling- 
Clark-Potter Arithmetic Test measures computational skill not only 
with whole numbers but also with common fractions, decimals, per- 
centages, and denominate numbers. The Reavis-Breslich Diagnostic 
Tests deal with both fundamental operations in arithmetic and with 
problem solving at the high school level. 

On entering high school, the pupil’s probable success in algebra 


306 How to Evaluate 


is frequently of prime significance in curriculum selection. Several 
algebra aptitude tests have been published, some of which measure 
ability to learn new material similar to algebra and respond to 
objective questions on it, while others measure aptitude in terms 
of previously acquired skills in arithmetic which are considered to 
be related to success in algebra. The Orleans Algebra Prognosis Test 
is of the former type, containing eleven simple lessons in algebra 
and tests on the lessons. The Iowa Algebra Aptitude Test contains 
subtests in arithmetic, abstract computation, numerical series, and 
dependence and variation. Representative tests for measuring algebra 
achievement are the Cooperative Algebra Test (elementary algebra 
. through quadratics), the Breslich Algebra Survey Test, and the 
Columbia Research Bureau Algebra Test. The Cooperative Inter- 
mediate Algebra Test is available for evaluating achievement in 
the higher levels of high school algebra. 

Geometry aptitude has been approached by the same two tech- 
niques as those used for algebra aptitude. The Orleans Geometry 
Prognosis Test contains simple lessons with tests in geometry, the 
material differing from that taught in pre-geometry subjects. The 
Towa Plane Geometry Aptitude Test makes its approach through 
achievement in algebraic computation, algebraic and arithmetic 
reasoning, reading geometry content, and visualization. 

It is not essential, of course, that a test used as a predictor of 
success be labeled "aptitude" or "prognostic." The Iowa Placement 
Examinations, Mathematics Training, has been found to have as 
much or more ability to predict success in college mathematics as 
the test in the same series labeled Mathematics Aptitude. The Iowa 
Mathematics Training Test, consisting of 75 short problems, assumes 
that a student has studied mathematics in high school for one year 
and yields a measurement of his preparation for continued study of 
this subject at the college level. 


Sources or INForMation CONCERNING Tests 


The preceding paragraphs have mentioned only a small propor- 
tion of all the tests published in these fields. Only the general nature 
and content of mental ability measurement and various important 
practical considerations have been presented, The reasons for limit- 
ing the discussion in this fashion are twofold: first, limitations of 


General Mental Abilities 307 


space prohibit any attempt to provide detailed specific information, 
evaluations, and recommendations concerning the large numbers 
of tests available in each of the fields discussed; second, the avail- 
ability of books which can satisfy far more thoroughly than is pos- 
sible here the need for an exhaustive listing and evaluation of avail- 
able tests makes such an attempt superfluous for a general textbook- 
manual in educational evaluation and measurement. We refer here 
specifically to Hildreth’s Bibliography of Mental Tests and Rating 
Scales (8) and to Buros’ Mental Measurements Yearbooks (4). Not 
only are they specifically designed for the functions which this book 
avoids, but they also have the advantage of being periodic. At the 
Present writing, two editions of both of these works have already 
been published (Hildreth’s Bibliography in 1933 and 1939; Buros’ 
Yearbooks for 1938 and 1940) and further editions will probably 
appear as needed. In this way these volumes can keep abreast of 
current developments in published standardized evaluation devices, 

A further source of information concerning the on-going deyelop- 
ment of evaluation devices is the Review of Educational Research 
(1). The issues entitled “Psychological Tests and Their Uses," 
“Educational Tests and Their Uses,” and “Methods of Research and 
Appraisal in Education” are especially rich in references to new 
tests and to research findings with old tests. These appear in three- 
year cycles and summarize very extensively the literature of the 
preceding three years in a given field. 

It is, of course, desirable for every classroom teacher and school 
administrator to have access to these works when selecting evalua- 
tion devices, When this is impossible, teachers and administrators 
may still secure these advantages by writing to the education research 
bureaus of universities for specific information concerning avail- 
able tests in a particular field. Such inquiries can be answered most 
helpfully by such a bureau when they concern a specifically men- 
tioned test or else state specifically the purposes, time limits, financial 
limits, and professional equipment limits of the proposed testing 
Program. By furnishing this information, teachers can secure better 
answers to their questions and at the same time save university re- 
Search bureaus much perplexity and frustration. In other words, 
intelligent questions should be asked if intelligent answers are 
desired. 


308 How to Evaluate 


SUMMARY 


Teachers are more directly concerned with the selection than the 
construction of intelligence tests. The prediction of ability to succeed 
in school is probably the most usual purpose of intelligence tests, 
"These tests are constructed according to such criteria as a conception 
of intelligence, equality of opportunity, increasing success as age in- 
creases, correlation with total test score, and administrability. Group 
tests are more useful than individual tests in the majority of situa- 
tions; similarly, verbal tests are needed more often than performance 
or non-language tests. Among the types of material often used in in- 
telligence tests are common observation, vocabulary, information, 
disarranged sentences, verbal analogies, spatial analogies, and num- 
ber series. Verbal ability tests are concerned with reading and English 
usage. Mathematical ability tests include achievement in arithmetic, 
algebra, and geometry. Useful sources of information concerning 
specific tests of intelligence as well as other aspects of pupils are 
Hildreth's Bibliography, Buros' Yearbooks, and certain numbers 
of the Review of Educational Research, 


QUESTIONS 


+ Examine your own school experience for instances where an in- 
telligence test could best have furnished certain information desirable 
for pupil guidance. 

2. Group the seven types of intelligence test material illustrated in this 
chapter into classes according to equality of familiarity or un- 
familiarity, to level of thought process required, and to a verbal, 
mathematical, and spatial grouping. 

3. With which of the types of test material in this chapter would you 
expect boys to have more success than girls? Could various intelligence 
tests be constructed that would show girls to be more intelligent 
than boys, and vice versa? Explain. 

4. Show how the various group differences in intelligence which have 
been revealed through intelligence testing depend on the type of 
test material included in the intelligence test. Consider, for example, 
how rural children could achieve higher intelligence test scores than 
urban children through the proper selection of test content, 

5. In what sense were Binet’s and subsequent intelligence tests in- 

adequate measures of ability to succeed in school? What other aspects 


nm 


General Mental. Abilities 309 


of pupils would you evaluate to make an estimate of this ability more 
comprehensive? Why, for example, do some pupils with high in- 
telligence test scores fail in school or in some school subjects? 


. Can you rank the various types of intelligence test material, includ- 


ing performance and non-language tests, in order of their relevance 
to successful living in our civilization? Why is verbal material so 
predominant in most intelligence tests? 


REFERENCES 


* American Educational Research Association, Review of Educational 


Research, Washington: American Educational Research Association, 


- Bingham, W. V., Aptitudes and Aptitude Testing, New York: 


Harper & Brothers, 1937. 


- Bronner, A. F., Healy, W., Lowe, G. M., and Shimberg, M. E., 


A Manual of Individual Mental Tests and Testing, Boston: Little, 
Brown & Company, 1928, 


* Buros, O. K, (ed.), Mental Measurements Yearbooks, Highland 


Park, N. J.: The Mental Measurements Yearbook, 1938, 1940. 


» Freeman, F. N., Mental Tests, Boston: Houghton Mifflin Company, 


revised edition, 1939. 


. Garrett, H, E., and Schneck, M. R., Psychological Tests, Methods, 


and Results, Part II, Chapters 1, 2, New York; Harper & Brothers, 
1933. 


- Gray, W. S., article on “Reading,” Encyclopedia of Educational Re- 


search, New York: The Macmillan Company, 1941. 


. Hildreth, G. H., A Bibliography of Mental Tests and Rating Scales, 


New York: The Psychological Corp., 1939. 


» Kelley, T. L., Interpretation of Educational Measurements, Yonkers: 


World Book Company, 1927. 


- Ruch, G. M., and Segel, D., Minimum Essentials of the Individual 


Inventory in Guidance, U.S. Dept. of the Interior, Office of Educa- 
tion, Vocational Division Bulletin No. 202. 


+ Terman, L. M., The Measurement of Intelligence, Boston: Houghton 


Mifflin Company, 1916. 


- Terman, L. M., and Merrill, M. A., Measuring Intelligence, Boston: 


Houghton Mifflin Company, 1937. 


- Thurstone, L. L., Primary Mental Abilities, Psychometric Mono- 


graphs No. r, Chicago: University of Chicago Press, 1938. 


- Toops, H. A., “The evolution of the Ohio State University Psycho- 


logical Test,” Ohio College Association Bulletin No. 113, 1939. 


CHAPTER XV 


Special Abilities 


IN THIS CHAPTER WE SHALL CONCERN OURSELVES WITH METHODS OF 
evaluating special abilities for the purposes of guidance. First we 
shall examine the theoretical bases for evaluation of this kind. 
Then the present status and implications of work on these bases in 
factor analysis will be discussed. Third, the implications of the 
nature of American vocations and of the present status of job in- 
formation for special ability evaluation will be set forth. Fourth, the 
general functions and usefulness of available devices for the evalua- 
tion of special abilities will be presented. Finally, the fields in which 
special ability tests are available will be discussed, and specific tests 
will be mentioned, 


TuHeorericaL Bases or Speca Apmiry EVALUATION 


In Chapter IV attention was called to the problem of the organiza- 
tion of mental abilities. Three types of theory—the “sand,” “gravel,” 
and "cobblestone"—were mentioned. Whichever of these is closest 
to the truth, the “gravel” theory—that is, the theory of group and 
specific factors—is the only one that implies the possibility of a pro- 
gram of vocational guidance based on psychological testing. This 
has been pointed out by Hull (3 1201-205). The “sand,” or specific 
factor, theory “pressed to its logical limit . . . would hold that the 
factors determining success in all activities—both aptitude behavior 
and test behavior—can be found in no other activity whatever; i.e; 
that they must be absolutely unique. Thus a test activity could never 
contain any of the determiners found in an aptitude activity. There 
could never be a correlation between the two,” Similarly, the “cobble- 

310 


ý Special Abilities 3n 
_ stone,” or general factor, theory “would preclude the possibility of 
ecasting by means of ordinary tests whether a person would be 
ore effective in one aptitude than another. Yt would be impossible 
differentiate the aptitudes within an individual. On this theory, 
vocational guidance based on psychological testing would be im- 
ge possible.” On the other hand, the “gravel,” or group factor, theory 
its strict form, assuming no universal factor or determiner 
— running through all possible intellectual activity and rejecting 
= specific or unique factors, would enable a test to correlate with one 
= aptitude while not correlating at all with another. “In other words, 
e existence of group factors would permit the possibility of dif- 
eta the potential aptitudes of an individual by means of 
tests.” It is pertinent to remind the reader once more that by 
- aptitude we mean measured ability predictive of ability capable of 
2 being developed in the future. 
____We sce thus that the hope of developing aptitude tests to predict 
— vocational success is based upon the degree to which the group factor 
theory is borne out. That is, aptitude testing requires tests which 
- enable the determination of differences in the amounts of certain 
aptitudes possessed by an individual. The existence of such tests 
provides a decisive indication of the existence of group factors. 
- "These tests have already been produced in considerable variety. 
_ Thus, Hull and Limp (3 : 42-43) produced test batteries for short- 
- hand, typewriting, high school English, and high school algebra 
- Which enabled the differentiation of the four aptitudes from each 
other. Many other tests of special abilities have been constructed 
Which succeed in predicting specific achievement rather than general 
achievement. 

It is necessary to make a distinction between special abilities and 
all other aspects of pupils which are predictive of vocational suc- 
cess. In one sense, every pupil aspect treated in this book may be 
considered an aptitude in so far as it is related to vocational success. 
Especially is this true of the interests of pupils, for many studies 
have been made to show that pupils differ widely in the degree to 
which their interests, or emotionally toned tastes, likes, and dislikes, 
fit them for various vocations. Similarly, the emotional and social 
adjustments of pupils, their “personalities,” serve to distinguish them 
from one another in fitness for vocational success in various fields. 


S) 


312 How to Evaluate 


Moreover, the pupil's socio-economic environment and background, 
his family, and the community in which he lives may be considered 
as determiners of his fitness or aptitude for one vocation as against 
others. "Thus we see that every aspect of pupils discussed in this 
book may be considered a part of his vocational aptitude, broadly 
conceived. In this chapter, however, we are concerned only with 
special abilities as indicators of vocational fitness. That is, not all 
aspects which are related to future success are considered, but only 
those which are abilities. "This distinction between abilities and 
other aspects is perhaps made clearer in terms of three definitions by 
Thurstone (8:48); he defines a trait as any attribute of an in- 
dividual, an ability as a trait defined by what an individual can do, 
and a test as the task, together with the method of appraising it, 
which defines an ability, 

An ability may be an achievement or development of the in- 
dividual in the same sense as we consider the pupil’s mastery of 
school subject matter to be an achievement. The abilities with which 
we are here concerned, however, may be distinguished from school 
achievements in that they are relatively not subject to change in the 
future, are relatively stable, and have reached, through previous 
maturation and experience, about as high a level as is possible for 
the given individual. At the time at which they are considered to 
be predictive of vocational success these special abilities are no longer 
readily acquired. Previous growth has presumably brought them to 
somewhere near their peak. Examples of these abilities, which will 
be taken up later in more detail, are ability to discriminate the pitch 
of musical tones, the abilities tapped by the Minnesota Clerical Test, 
visual acuity, rate of tapping, and so forth, 

Tt is well known both to psychologists and to laymen that not all 
individuals possess these special abilities in the same degree, This 
is, of course, the implication of the group factor theory of mental 
organization, We cannot successfully characterize an individual's 
abilities or potentialities for success in a great number of varied tasks 
in terms of a single general ability score. Two individuals with the 
same general ability score may differ from each other in the composi- 
tion of that general ability; one, for example, may have a high 
verbal ability and low mathematical ability while the other has a 
high mathematical ability and low verbal ability. Consequently, in 


Special Abilities 313 


attempting to guide pupils among the various curricula and voca- 
tions available, some breakdown of general ability into special abilities 
is required. 


Present Status or FAcTOR ÁNALYsIS 


In Chapter IV two methods of making this breakdown were 
mentioned. One is in terms of statistically derived “primary mental 
abilities” and the other is in terms of culturally determined fields of 
endeavor. The formulation in terms of primary abilities may be 
illustrated by the following factors which have been reported by 
various factor analysts (9 : 30-33): 

1. A verbal factor, involved primarily in those tests which depend 
upon the meanings of words and the ideas associated with them. 

2. A space factor, which appears in tasks requiring reactions to 
spatial relations, such as reading plans or blueprints or telling 
whether two drawings represent one or more sides of an asymmetri- 
cal figure. 

3. A number factor, requiring such simple numerical operations 
as multiplication, addition, subtraction, and division, but not the 
more complex tasks involving numerical reasoning. 

4. A memory factor, requiring paired associations or the recogni- 
tion of recently learned material, 

5. A mental speed factor. 

6. A perceptual factor, or readiness to discover and identify per- 
ceptual details. 

7. Deduction and induction factors, measured, respectively, by 
syllogistic reasoning tests and by tests requiring the subject to find 
the rule which binds a number of items together and from it to 
classify or predict other items. 

It is evident from the description of primary mental abilities that 
none of these bears a direct and obvious relationship to any specific 
vocation, such as automobile mechanic or lawyer. It would be ex- 
pected, however, that some of them would be involved in some 
Vocations more than in others. Thus the spatial factor is probably 
involved to a high degree in the work of a draftsman, whereas 
lawyers and writers need more of the verbal factor. The general 
nature of the uses of factor analyses of primary mental abilities in 
vocational guidance may be indicated as follows: 


314 How to Evaluate 


1, Determine the degree to which each ability is predictive of 
success in each occupation. This would be in the form of a correla- 
tion coefficient, from which a weighting of the ability could be 
derived. r 

2. Determine the score in each ability of the individual being 
counseled. 

3. Multiply the ability scores by the weights and add the products, 
The resulting sum would be the individual's predicted success in 
that occupation. 

Vocational counseling would indeed be a mechanical, easily 
automatic, yet highly accurate procedure if the tools necessary for the 
above procedure were available. Such an advanced stage has obvi- 
ously not yet been reached, for the necessary primary abilities have 
not yet been isolated, tests to measure the extent to which in- 
dividuals possess these abilities are not yet available, and the weights 
of each ability in each occupation have not been determined. Per- 
haps after considerable progress has been made on the basis of much 
additional research along these lines, vocational guidance will be- 
come a matter of easy routine. For the present it will be necessary 
to predict vocational success on the basis not of a test battery 
common to all vocations but of test batteries designed specifically 
for single occupations or groups of occupations. That is, many dif- 
ferent batteries of ability tests predictive of vocational achievement 
are necessary, rather than one universally applicable set of tests for 
primary abilities, 


Tue NATURE or American Vocations 


It would be impossible, however, to construct a test for the predic- 
tion of success in each of the 23,000 to 25,000 distinct jobs which 
have been defined by the United States Employment Service. The 
usability of such a vast battery of tests, even if one could be con- 
structed, would also be small for guidance purposes because the 
problem of which test to give to which pupil would be almost 
insurmountable. As was pointed out in Chapter IV, information con- 
cerning the thousands of distinct jobs or occupations in which the 
millions of workers in this country are employed is available in the 
publications of the U. S. Employment Service. The Dictionary of 
Occupational Titles, Part I, has already been mentioned. This 


Special Abilities 315 


Dictionary is based upon data collected by trained job analysts who 
at first hand observed workers at work, their duties, the machines 
and tools used, the hiring requirements of employers, and the esti- 
mated abilities necessary to make an adjustment in each job. It is 
estimated that about 80 per cent of all occupations are covered in the 
present edition of the Dictionary. 

The problem of simplifying and clarifying the consideration of 
23,000 separate occupations for vocational guidance purposes be- 
comes immediately apparent once the tremendous variety of Ameri- 
can economic endeavor is realized. It is necessary, as Shartle 
(6 : 45-51) points out, to group occupations into families based on 
different lines of relationship, if the meanings and usefulness of 
these vast accumulations of job descriptions are to be grasped. One 
such classification has been attempted by the Census Bureau, but its 
organization on an industrial basis lessens its usefulness for guidance 
purposes. The U. S. Employment Service classification is based on 
the similarity of work performed and materials used, and is as 
follows: ` 


Mayor Occupationat Groups AND Divisions 


0—Professional and managerial occupations. 
o-othrougho-3 Professional occupations. 
o~4 through o-6 Semi-professional occupations. 
o-7 through o-9 Managerial and official occupations, 
1—Clerical and sales occupations. 
1-0 through 1-4 Clerical and kindred occupations. 
1-5 through 1-9 Sales and kindred occupations. 
2—Service Occupations. 


2-0 , Domestic service occupations. 
2-2 through 2-5 Personal service occupations. 
* 2-6 Protective service occupations, 


2-8 and 2-9 Building service workers and porters. 
3—Agricultural, fishery, forestry, and kindred occupations. 
3-0 through 3-4 Agricultural, horticultural, and kindred occupa- 


tions. 
3-8 Fishery occupations. , 
379 Forestry (except logging), and hunting and trapping 
occupations. 


4and 5—Skilled occupations. 
6 and 7—Semi-skilled occupations. 
8 and g—Unskilled occupations. 


316 How to Evaluate 


In Part II of the Dictionary of Occupational Titles, Titles and 
Codes, each of the thousands of jobs defined in Part I is classified 
along with other jobs to which it is related. "Thus, in Part II can be 
found lists of occupations grouped according to the work performed 
and the materials used; the definitions of these jobs are easily found 
in Part I of the Dictionary. Similarly, if the definition of a job title 
is first found in Part I, jobs which are related to it may be found by 
looking up its code number in Part II and noting the occupational 
titles that are adjacent to it there. The definitions in Part I are also 
classified by industry, all occupations peculiar to a given industry 
being listed together; those occupations that are found in industries 
generally are excluded. 

Other occupational classifications are being worked out by the 
U. S. Employment Service with a view to serving the need of ade- 
quate programs of vocational guidance, Among those mentioned 
by Shartle (6 : 48) are occupational classifications based on: 

1. The personal worker qualifications required for success 

2. The aptitude needed to learn the job 

3. Educational prerequisites necessary for entrance 
On the basis of such classifications, it is hoped that answers will 
eventually be given to such questions as these (6 :48): “What are 
the occupations that might be open to a person whose chief ability 
and interest is in dealing directly with people? What are the occupa- 
tions, regardless of industry, that might be entered by a youth 
Who has the aptitude and interest for operating electrical machines, 
or the youth who has unusual finger dexterity, or unusual eye-hand 
coordination? What are the eighth-grade jobs, the high school jobs, 
the college jobs?” If we knew the world of work in this way and 
also the pupil, it would be possible to point out for him a family of 
similar jobs which best fit his traits, For example, knowing that a 
pupil has a high school education, poor vision, an excellent sense 
of taste, high finger dexterity, poor memory for details, excellent 
memory for abstract ideas, low ability in arithmetical computation, 
a high degree of interest in mechanical devices, ability to estimate 
the size and form of objects, and many other similar traits and 
characteristics, we may eventually be able to provide him with a list 
of occupations which require his particular set of traits. For each 
job studied by the U, S. Employment Service, the job and worker 


Special Abilities 317 
characteristics are punched along the four edges of a card. By using 
a hand sorting needle, it is possible to select occupations accord- 
ing to various combinations of about 8o of these job and worker 
characteristics (7 : 181). Families of occupations can be distinguished 
by inspecting the punched edges of a large pack of these master 
occupational cards and selecting those that have the largest number 
of similar punched edges. 

Preliminary results of this program of determining occupational 
relationships have been reported by Stead, Shartle, and associates 
(7 : 188-205), whose work promises to be of great usefulness in 
vocational guidance. While their efforts are primarily directed 
toward the problems encountered in public employment services, it 
is expected that vocational training and guidance workers and 
classroom teachers will also find much use for their material. The 
worker characteristics form shown in Figure 7 is filled out by trained 
job analysts for each job, The analyst indicates, by checking Column 
A, B, or C, “what amount of each characteristic is demanded of the 
worker in order to do the job satisfactorily.” The amounts desig- 
nated by these letters are: 

A: A very great amount of the trait such as would be possessed 
by not more than 2 per cent of the general population. 

B: A distinctly above-average amount of the trait, less than that 
designated by A but more than that designated by C. 

C: An amount of the trait less than that possessed by the highest 
30 per cent of the general population. 

It is recognized that this method of estimating rather than meas- 
uring worker characteristics required for success on a job is not of 
optimum exactness. It has the major advantage, however, of being 
far speedier in yielding results for a large number of occupations. 

The nature and promise of this work in occupational classification 
has been summarized by Ward (7 :206): 


Eventually it may be possible to set up tables of occupations in which 
certain qualifications are of primary importance, such as personal 
appearance, ability to meet the public, strength, dexterities, coordina- 
tions, and amount of schooling. For the counselor, occupations may be 
identified in a code book by a family number for easy reference to the 
various groupings. . . . y 

Regardless of the method used in revealing occupational families aad 


PK P pambeg sons) 


E— — Zooi wonivinaoa Wwu3an39 ——+} 


Iz Lez i ZoL 
8 9 
Zos 
; Jano 31VIS ONLY 


“O 20 ‘g “y uumjoo 
SEX we Punind 4q Apu0joejens qof 2] Op 03 s2ps0 ur jox:0m R jo popuewap 2nsuojé:802 qos3 jo junoure oq ojeorpug 
-——- "ON e;peuos i i OUI EDEN URUR aera) gof 


WuO3 SDLLSDISIOVNVHO HIZXNHOM 
321AH3S 1N3A014W3 S31y1S aaxuun 
WO8Y1.30 1N3A1uvaxg 'S *f 


"sonsuapeireQqo IDIOM jo cera ee 0] 3[E29—Z "5 


320 How to Evaluate 


their relationships, it seems assured that counselors must be provided 
with such data if the occupational structure of industry is to be ade- 
quately understood. 


Needless to say, this high development of the whole aspect of 
vocational guidance classifiable under the heading of occupational 
information will have to be adapted to particular community situa- 
tions and needs. National publications issued by the U. S. Employ- 
ment Service, such as the Dictionary of Occupational Titles and, 
eventually, sets of master pattern cards punched with the worker 
characteristics required by every kind of work in the country will 
have to be supplemented by job information and analyses based on 
the individual community whose youth is to benefit from a par- 
ticular vocational guidance program. This work of gathering and 
dassifying local occupational information will probably be best 
carried out when organized as a cooperative community project that 
benefits from the efforts of the schools, the public state employ- 
ment service, neighboring universities, and various other public 
service agencies, Methods of developing local community programs 
for the promotion of occupational adjustment are described in 
Bell (x). : 

In preceding paragraphs we have outlined the most advanced of 
the present-day movements to categorize the field of vocational 
guidance. On the side of the worker or pupil, an attempt has been 
made, by means of factorial analyses of tests of mental abilities, to 
isolate a relatively small number of primary traits on the basis of 
which predictions of vocational Success can be made. On the side 
of the work to be done, the occupations or jobs, we have noted the 
attempts by the U, S. Employment Service to develop families of 
jobs, or a relatively small number of occupational groups, in place 
of long lists of the thousands of currently identifiable jobs. The 
attempt to classify occupations has, however, proceeded in terms of a 
list of about fifty worker Characteristics, such as those shown in 
Fig. 7. It is thus seen that this Service has in its own way attempted 
something similar to that at which the factor analysts have been 
aiming. A major difference, of course, is that while the latter have 
been concerned primarily with the nature of mental abilities, the 
Employment Service has taken account of all kinds of characteristics 
Which may be related to vocational success, 


Special Abilities 321 


The difference between the two approaches has probably resulted 
from a difference in original interest. The theory of mental structure 
and the nature of mental ability have interested the factor analysts 
for their own sake. Consequently they have attempted to isolate 
primary traits, pure “functional unities,” without much prior regard 
for the usefulness in counseling of the material in the original test 
battery which was analyzed. The Employment Service has from 
the first been oriented toward the practical problems of occupa- 
tional counseling. 

Consequently the Service has begun with material which is of 
importance in counseling, without undue regard for whether each 
item or job-worker characteristic is primary, unique, or statistically 
independent of the rest. It seems apparent that a union of the two 
approaches is in order if the advantages of both are to be realized. 
The Employment Service approach insures usefulness and validity 
and the factor analysts provide efficiency by reducing the necessary 
number of characteristics of both jobs and workers to a minimum. 
T. L. Kelley (5 :8-9) has emphasized the need for prior concern 
with the usefulness of factor analyses as follows: “(a) Any set of 
mental factors proposed for general use should aid in the under- 
standing of distinctions which are important in the lives of people 
and in the processes of society. The test of merit of any set of 
factors in the degree to which they (b) accurately, (c) fully, and 
(d) economically facilitate this understanding of social living in its 
dual aspect of individual and social welfare.” 

It is evident that the type of guidance for which the factors of the 
present tests of primary mental ability would be useful is restricted 
to those secondary school students who are interested in making a 
choice among the various institutions of higher learning and the 
various professional and semi-professional occupations, at which 
level only a small proportion of all American youth will find jobs. 
This criticism of attempts to provide a foundation for vocational 
guidance in terms of primary mental abilities applies, however, only 
to the present form of the results of these attempts. It is quite 
probable that future efforts to reduce the description of pupils to a 
relatively small number of independent, unique traits will proceed 
in terms of such more varied characteristics as those listed in Fig. 7; 
that is, traits related to the great mass of American occupations 


322 How to Evaluate 


rather than only to those at the professional levels will have to be 
taken into account. 

The reader should realize that these attempts to reduce pupil and 
worker characteristics, on the one hand, and the vast number of 
different occupations, on the other, to a relatively small number of 
traits or occupational families find their justification in guidance 
rather than in selection. Selection is concerned with a given job and 
the choice of workers for it. The job is already singled out, its 
characteristics and requirements are well defined; hence the task of 
constructing tests to predict success in it can proceed along the 
following well-established lines: 

1. Develop a criterion of success in the activity or job for which 
workers are to be selected; for example, the production per hour of x 
units of quality g or better, with 5 per cent spoilage allowed. 

2. Analyze the job into the skills, activities, and other worker 
traits which are related to success in it, 

3. Construct a test or set of test items which seem to clicit the 
skills and activities determined in Step 2. 

4. Apply these tests to a group of persons who differ appreciably 
in success on the job and for whom criterion scores are available. 

5. Determine the validity of the test in terms of the correlation be- 
tween scores on the test and criterion scores. Eliminate the tests or 
test items which are not significantly correlated with the criterion 
scores, 

It is easily appreciated that if this procedure were to be used in 
guidance in the absence of any organization of occupations or of 
workers’ ability into relatively small groups, guidance would re- 
quire (1) the development of an aptitude test for each of the thou- 
sands of existing jobs, and (2) the administration of a large number 
of these tests to every individual who is to be guided into the oc- 
cupation for which he is best suited. It is the impracticality of 
vocational guidance based on such a cumbersome and unwieldy 
method which has motivated the movements toward the isolation 
of primary characteristics and the organization of occupations into 
families. 

The factor analysis and occupational grouping movements have 
not yet reached such a state or development that their results can 
be used by classroom teachers in practical work with their own 


Special Abilities 23 


pupils. But the need for vocational guidance will not wait upon 
such developments; hence the teacher or guidance counselor is 
faced with the necessity of using as valuably as possible the materials 
already at hand. The time which must elapse before adequate meas- 
ures of all or most of an individual's occupationally related charac- 
teristics will be available in terms of the relatively small number of 
test scores is much too long to permit postponing some attempt to 
meet the vocational guidance needs of youth, 


GeneraL Funcrions or AVAILABLE Devices 


Fortunately, the techniques and equipment already at hand enable 
some valuable guidance assistance to be given. There are available a 
considerable number of special ability evaluation devices which 
enable the prediction of success in the various fields of economic 
endeavor. And in the practical school situation there are available 
various techniques for narrowing down to a relatively small number 
the choice of tests to be given to a particular individual. “Negative” 
guidance, or the elimination of broad areas of occupations as pos- 
sibilities for a given person, can be carried out with these techniques. 
The “narrowing-down” procedure involves the use of all the in- 
formation about a pupil in each of the various aspects with which 
this book is concerned, His scholastic achievement, physical aspects, 
general mental ability, social and emotional adjustment, and socio- 
economic environment and background all provide a detailed and 
comprehensive picture of his make-up. From this picture it is pos- 
sible to ascertain those jobs in. which he is most likely to be success- 
ful. His attitudes and interests are of special importance in this 
respect. We shall be especially concerned with them in a later 
chapter, Here it is fitting that we note the possibilities of vocational 
interest evaluation for the delimitation of the field in which special 
ability tests should be given. 

"The role of the special ability test is thus supplementary; it serves 
to round out and complete the picture of the pupil's vocational 
strengths and weaknesses. Instead of a test for each vocational ap- 
titude, only a few such tests are necessary in order to answer 
questions about a pupil which are not already answered by the other 
data concerning him. An illustration will perhaps clarify this point. 
Let us suppose that a pupil is undecided whether to enter an 


324 How to Evaluate 


engineering college or to major in mathematics in a liberal arts 
college with a view toward becoming an actuarial worker. His 
scholastic achievement indicates that he is equipped to do college 
work and is sufficiently strong in mathematics for either occupa- 
tion. His physical status is adequate for either outdoor engineering 
work or indoor actuarial research. His general scholastic aptitude 
is of a high level and he rates equally high in both mathematical 
and verbal factors. His social and personal adjustment is of such a 
nature that he could manage men as an engineer or enjoy the 
solitary nature of actuarial research equally well. In the same way 


his attitudes and interests do not permit of a discrimination between ' 


the two fields, and his socio-economic environment and background, 
other than providing sufficient financial resources for either kind of 
training, does not afford any clue as to which would be the better 
vocational choice for him. 

In such a hypothetical situation, probably also quite rare, a test 
of special ability might furnish the bit of evidence necessary to 
weigh the balance in favor of one vocation as against the other. 
In this case, it might be found that the pupil's ability to handle 
spatial relations is far below that of most engineers. For the counselor 
would realize, along with Bingham (2 : 172), that "ability to perceive 
the sizes, shapes, and relations of objects in space and to think 
quickly and clearly about these relations is another distinct asset for a 
student of engineering. He must be able to see how the parts of 
a mechanism fit together, and to infer what happens to one part 
when another part moves. Many engineers, although not all, are 
facile visualizers. But all, whether excelling in their powers of 
visual imagination or not, must somehow learn to read diagrams and 
Prepare blueprints, to make and read topographical maps and 


profiles, to translate two-dimensional sketches into three-dimensional - 


models, and vice versa, Aptitudes for thinking’ about shapes, sizes, 
and spatial relations are particularly valuable in the study of draft- 
ing, descriptive geometry, and mechanics," 

In this case, this difference between the pupil’s ability to handle 
Spatial relations and his otherwise high degree of potentialities for 
success in both engineering and actuarial work might be sufficient 
to enable his vocational guidance counselor to encourage him 
toward the latter and away from the former. The problem for 


f 


Special Abilities 325 


this pupil would have been narrowed down by all the other data 
concerning him to the point where one special ability test, such 
as the Minnesota Spatial Relations Test, the Kent-Shakow Form 
Board, or the Minnesota Paper Form Board, might have been 
sufficient to provide decisive evidence. It is in situations such as 
this that the devices for evaluating special abilities to be discussed 
below find their greatest usefulness for the classroom teacher. 


AVAILABLE SPECIAL ABrLrTY Tests 


What are the major fields for which special ability tests have been 
devised? What specific tests are available in each of these fields? 
Let us now attempt to answer these practical questions, remem- 
bering that vocational guidance presents a problem in which a 
single individual must choose from among many vocations, while 
vocational selection requires choosing from among many individuals 
those who are best fitted for a given job. 

Tests of special abilities have been developed, and usually pub- 
lished, for the use of vocational guidance workers mainly in the 
following fields: 

i. Mechanical ability 

2. Manual dexterity 

3. Clerical ability 

4. Music ability 

5. Art ability 

6. Professional abilities: medicine, law, engineering, nursing 

Mechanical Ability Tests—Mechanical ability tests may be classi- 
fied as either (1) mechanical assembly tests, (2) spatial relations 
tests, or (3) tests of mechanical information. Typical of the first is 
the Minnesota Mechanical Assembly Test, a set of three boxes 
containing simple mechanical objects, such as a bicycle bell, a 
monkey wrench, and a metal pencil. The subject is required to 
assemble these within given time limits and his product is scored 
With partial credit. The test is valuable for predicting the success of 
junior high school boys in shop courses but is not very useful for 
Older persons. It may be criticized on the ground of the possible 
large variation in scores resulting from crude and inadequate ma- 
terials used in the simple mechanical objects. $ 

Performance tests of spatial relations may be illustrated by the 


326 How to Evaluate 


Minnesota Spatial Relations Test and the Kent-Shakow Form 
Boards. 'The Minnesota test consists of four boards with 58 odd- 
shaped cutouts which the pupil is instructed to put in their proper 
places in the board as rapidly as possible. The score, amount of time 
required, is intended to be an indicator of probable success in high 
School shop courses and in such occupations as garage mechanic, 
manual training teacher, and ornamental iron worker. 


Tate ro.—MzcnaNicAL Anrrry Tests 


EAR nnL e Poi o 


Grades Time (min.) Publisher No. 
Name of Test Designed for NO- OfForms Required to (see list on 
gl Give pp- 561-562) 


: 


Minnesota Mechanical 7-10 I 60 22 
Assembly Test 

Minnesota Spatial Rela- 716 1 15-45 22 
tions Test 

Kent-Shakow Form Board 1-17 I 20-40 33 

Minnesota Paper Form 7-10 1 dO v 26 
Board, Revised 

Detroit Mechanical Apti- 7-16 T 40 28 
tudes Examination 

O'Rourke Mechanical 7 3 70 27 
Aptitude Test 

Stenquist Mechanical Ap- 5 1 95 38 


titude Tests, I and II 
Em e e e MI zat Tema aedi, Ea a e 5 


The Kent-Shakow Form Boards contain five holes or recesses into 
Which a graded series of eight sets of blocks must be fitted. The 
score, time required to fill the five recesses, is intended to be useful at 
all ages above six in determining fitness for mechanical occupa- 
tions, 

A paper and pencil test of spatial relations is the Minnesota Paper 
Form Board, Revised, which consists of diagrams of disarranged 
parts of two-dimensional figures. The task is to select from five al- 
ternatives the diagram which indicates how the parts fit together. 
The score, number correct out of 64 items, may be interpreted in 
relation to the scores of engineering students, first-year vocational 
school pupils, and elementary school boys and girls of different 
grades and ages. 

Paper and pencil tests of mechanical information are illustrated by 


Special Abilities 327 


the Detroit Mechanical Aptitudes Examination; the O'Rourke 
Mechanical Aptitude Test: Junior Grade; and the Stenquist Me- 
chanical Aptitude Tests, I and II. The Detroit test consists of eight 
subtests: tool recognition, motor speed, size discrimination, arith- 
metic fundamentals, disarranged pictures, tool information, belt 
and pulleys, and classification, Its high correlations with general 
mental ability and the lack of data concerning its validity render it 
less useful than is to be desired. The O'Rourke Mechanical Apti- 
tude Test proceeds on the assumption that the amount of mechanical 
information possessed by an individual reflects interest in and apti- 
tude for mechanical activities. Pictorial and verbal material con- 
cerning the applicability of tools and mechanical processes in match- 
ing and multiple-choice form is presented. The Stenquist Me- 
chanical Aptitude Test requires the pairing of pictures of parts of 
common tools, contrivances, and machines. 

Manual Dexterity Tests—Manual dexterity tests measure the 
ability to work rapidly and skillfully with the fingers, hands, and 
arms, Steadiriess and eye-hand coordination at various levels of 
complexity are required by the different tests. The Minnesota Rate 
of Manipulation Test is intended to measure rapidity of movement 
in working at simple tasks involving the hand and fingers. Part One, 
Placing, requires placing 60 cylindrical blocks in 60 regularly ar- 
ranged holes in a board. The score is the total time required for 
four trials after one practice trial. Part Two, Turning, requires 
the subject to pick up each block from its hole, turn it over, and 


Taste 1i.—MaNuaL Dexrerrry Tests 


Time (min.) Publisher No. 


Grades n ; 
Name of Test 5 No. of Forms wired to (see list on 
Designed for WE pP. 561-562) 
Minnesota Rate of Ma- — ro-adults 1 15-20 14 
nipulation Test 
O'Connor Finger Dexter- ^ ro-adults 1 10-20 23 
ity Test 
O'Connor Tweezer Dex- — ro-adults 1 10-20 23 
terity Test 
LER. Assembly Tests 7 1 20-40 9 


for Girls, Abridged 
PRETEREA IN RATS RI E ol dagen S 


328 How to Evaluate 


replace it with the other hand. After each row of fifteen blocks, 
the direction and hand functions are reversed. Scoring is the same 
as for Placing. This test is considered useful in predicting success as 
a packer, wrapper, cartoner, or similar routine manipulative worker, 

The O'Connor Finger Dexterity Test requires picking up three 
pins at a time from a tray and inserting them in small holes in a 
metal plate. The score is the time taken to fill the 100 holes in the 
plate. The test is useful in relation to occupations involving rapid 
handling of small objects, such as assembling clocks and radio 
fixtures or operating keyboard office machines. The O'Connor 
Tweezer Dexterity Test uses the reverse side of the metal plate; 
here the holes are large enough for only one pin at a time. The 
pins are picked up from the tray one at a time with tweezers and 
inserted in the holes as rapidly as possible. The score, time required 
for the 100 holes to be filled, is related to success in occupations re- 
quiring hand steadiness and eye-hand coordination, such as labora- 
tory work, surgery, drafting, and watch repairing. 

The LE.R. Assembly Test for Girls: Abridged Form presents 
seven tasks, such as sewing a piece of tape on a strip of muslin and 
Paper cutting and trimming. The tasks are selected for their interest 
to girls. The scoring of each task is a product of evaluation with 
partial credit. The test is intended to predict success at assembling 
jobs in terms of ability to work with the hands, 

Clerical Ability Tests.—The Minnesota Vocational "Test for Cleri- 
cal Workers consists of two parts, number comparison and name 
comparison. Numbers or names are presented in pairs separated by 
a line on which a check is to be marked if the members of the pair 
are exactly the same. 


Examples: 147 V/ 147, 3896 


The score, number correctly marked or left blank minus the number 
incorrect, is considered to be related to success in occupations re- 
quiring attention to clerical detail, such as bookkeeping, work as 
a bank teller, office machine operating, and stenography. 

The O'Rourke Clerical Aptitude Test: Junior Grade consists of 
nine parts: alphabetical filing; simple computations; classifying in- 


E 


Special Abilities 329 " 
TABLE 12.—CrericaL Aptity Tests 
—_). SSS 


Time (min.) Publisher No. 


Name of Test eT No. of Forms Requiredto (see list on 
Give PP. 561-562) 

—————Á— ————— 
Minnesota Vocational Ages 17 up 1 20 26 

Test for Clerical Work- (women); 

ers ages 19 up 

(men) 

O'Rourke Clerical Apti- Ages 17 and 1 6o 26 

tude Test up 
General Test for Stenog- Ages 17 and I 95 26 

raphers and Typists up 


dividuals according to residence, occupation, age, and so forth; 
comparing names and addresses, reading, spelling, analogies, general 
information, and arithmetic problems. The test has been validated 
against success as a typist or stenographer. 

The United States Civil Service Commission has developed a 
Gencral Test for Stenographers and Typists which includes vocabu- 
lary, English usage, spelling, reading comprehension, and “practical 
judgment” items, The battery was found to differentiate fairly well 
between good and poor stenographers and typists. Scores are in- 
terpretable in terms of those obtained by working stenographers and 
typists and of civil service eligibles. The test has been released for 
use in schools and industries. 

Music Ability Tests—The Seashore Measures of Musical Talent, 
Revised Edition, consist of two series of three double-faced phono- 
graph records measuring sense of pitch, sense of intensity, sense of 
time, tonal memory, sense of rhythm, and sense of timbre. "These 
subtests, based on a psychological analysis of musical talent, are 
Played to the subjects, who record their answer on special blanks. 
For example, the first test, sense of pitch, presents a number of 
Paired sounds and requires the subject to indicate whether the 
second sound is higher or lower in pitch than the first. The meas- 
ures may be used to help predict success in music as an avocation 
or as a career. Series A, covering a wide range of difficulty, is used 
for unselected groups. Series B is intended for sharp discrimination 
among musically superior individuals, 


330 How to Evaluate 5 


Taste 13.—Mosicat Tarent Tests 

Grad Time (min.) Publisher No, 

Name of Test F ak No. of Forms Required to (see list on 
Designed ion Give Pp. 561-562) 
v 


Seashore Measures of 5-16 and 2 60-80 29 
Musical Talent adults 

Drake Musical Memory Ages 8 and 2 25 28 
Test over 

Kwalwasser-Dykema 5-16 and 1 60-80 26 
Tests adults 


ES ee, o - 
The Drake Musical Memory Test consists of 24 original two-bar 
melodies to be played on a piano by the examiner or an assistant, Fol- 
lowing each of the standard melodies, two to seven variations differ- 
ing from the standard in key, time, or notes are presented. The score, 
total number of errors in classifying the variations correctly, is said to 
correlate with music teachers’ estimates of “innate musical capacity.” 
The Kwalwasser-Dykema Tests resemble the Seashore tests in 
using a set of phonograph records. Ten elements of musical ability 
are approached on the five double-faced records: tonal memory 
(recognition), quality discrimination, intensity discrimination, tonal _ 
memory (completion), time discrimination, rhythm discrimination, — 
pitch discrimination, melodic taste, pitch imagery, and rhythm | 
imagery. 
Art Ability Tests—The Meier-Seashore Art Judgment Test re- 
quires the selection of the more artistic picture in each of a series of 
125 pairs. One of each pair is a reproduction of an artistic work 
of recognized merit, while the other has been altered in some way 
so as to lower its merit, make it less pleasing, less artistic, less satis- 
fying. The score, number of correct choices, may be interpreted with 
respect to norms for various grade levels from the seventh grade 
through senior high school. It furnishes a measure of one constituent 
of artistic talent, the “capacity for perceiving quality in aesthetic 
situations relatively apart from formal training.” ji 
The McAdory Art Test consists of 72 plates presenting four 
variations of the same theme, each to be ranked in order of merit. 
Six kinds of test material are included: furniture and utensils, 
texture and clothing, architecture, shape and line arrangement, dark 


Special Abilities 331 


Taste 14.—Arr AsıLrry Tests 


Time (min.) Publisher No. 


Grades ; i 
Name of Test E No. of Forms Required to (see list on 
Designed for Give pp. 561-562) 
Meier-Seashore Art Judg- 71 1 45750 5 
ment Test 
McAdory Art Test 1-17 1 9o 6 
Knauber Art Ability Test 7-16 and 1 180 1 
adults 
Lewerenz Test in Funda- 3-11 1 10$ 7 
mental Abilities of Vis- i 
ual Art 


and light masses, and color. These materials, although practical and 
functional, are subject to becoming outmoded by fashion changes 
which will change the standards upon which the test is to be scored. 
The score, based on agreement with the ranking of 100 competent 
judges, provides a functional measure of aesthetic judgment and 
perhaps an indirect indication of creative art ability. 

The Knauber Art Ability Test requires drawing a design from 
memory, drawing figures within space limitations from memory, ar- 
ranging a specified composition within a given space, creating and 
completing designs from supplied elements, spotting errors in drawn 
composition, and finally drawing a composition “using your own 
symbol for labor.” The scoring is semi-subjective, but high reliability 
Coefficients are reported by the author. The test may be used to in- 
dicate progress in art classes and creative ability rather than aesthetic 
judgment. 

The Lewerenz Test in Fundamental Abilities of Visual Art con- 
Sists of nine tests: recognition of proportions, originality of line 
drawing, observation of light and shade, knowledge of subject 
matter vocabulary, visual memory of proportion, analysis of prob- 
lems in cylindrical perspective, analysis of problems in parallel per- 
Spective, analysis of problems in angular perspective, recognition 
of color. Both judgment or taste and creative ability seem to be 
tapped by this group of tests. 

Professional Aptitude Tests.—Kandel (4) has summarized the 
attempts and results obtained in the fields of medicine, law, and 


332 How to Evaluate 


engineering. The Medical Aptitude Test of the Association of Ameri- 
can Medical Colleges is issued annually in a new form whose use is 
restricted to medical colleges. The test is given every year at many 
universities to applicants for admission to medical schools. Although 
it is used mainly as a selection device, it can be of value in guidance 
if college students contemplating a medical career are advised by 
counselors to make their plans in accordance with scores obtained 
On it. Six subtests are included: comprehension and retention, visual 
memory, memory for content, logical reasoning, scientific vocabulary, 
and understanding of printed material. 

‘The Stoddard-Ferson Law Aptitude Examination consists of five 
Parts: capacity for accurate recall, comprehension and reasoning by 
analogy, comprehension and reasoning by analysis, skill and sym- 
bolic logic, comprehension of difficult reading. Although not gen- 
erally available for counselors in high schools and universities, it 
should be called to the attention of any college student who con- 
templates entering a law school. The test has been useful as a sup. 
plement to other evidence, such as college grades, in predicting 
success in law school work, 

Engincering aptitude tests have taken the form mainly of mathe- 
matical ability or achievement tests and spatial perception tests, to- 
gether with measures of general scholastic aptitude, Any of the 
large number of available good tests in these three fields, together 
with all other data concerning the pupils scholastic achievement, 
especially his vocational interests, provides the best indication pos- 
sible at present of probable success in an engineering curriculum., 

Nursing aptitude has been approached through the Moss-Hunt 


Tanie 15.—ProvussionaL Amiry Tests 
Time (min.) Publisher No. 


Grades : 
Name of Test No. of Forms wired to (see list on 
Designed foe ig os pp. 561-562) 


Medical Aptitude Test 15-17, adults new form an- 110 10 
nually 
Sroddard-Ferson Law 15717, adults 1 65 36 
Aptitude Examination 
Moss-Hune Aptitude Test 12-17, adults 1 4o 26 


for Nursing 


a en ns oe t 


Special Abiliti ia 


Aptitude Test for Nursing, which deals with scientific vocabulary, 
general information, understanding of printed material, visual 
memory, memory for content, comprehension and retention, and 
ability to understand and follow directions. While the test material 
has been selected for its relevance to nursing work, no previous 
training in nursing is assumed. The scores on this test have been 
found to correlate substantially with ability to handle the scholastic 
material in the first year of training, 

General References on Special Ability Evaluation Devices.—In 
addition to the Mental Measurements Yearbooks discussed in the 
preceding chapter, several other volumes can provide valuable as- 
Sistance in the selection, administration, and interpretation of devices 
for the evaluation of special abilities, Among these are the following: 

Bingham, W. V., Aptitudes and Aptitude Testing, New York: 

Harper & Brothers, 1937. 

Paterson, D. G., Schneidler, G. G., and Williamson, E. G., Student 

Guidance Techniques, New York: McGraw-Hill Book Company, 

Inc., 1938. 

In addition to discussions of aptitudes and guidance, of, various 
occupational fields, and of the practice of testing, Bingham's book 
contains a valuable appendix that presents fairly detailed descriptions 
of representative special ability tests and interest schedules of special 
Usefulness in vocational guidance, Chapter VIII of Student Guidance 
Techniques discusses more fully than this chapter most of the tests 
here mentioned. Each test is described, the group it is designed for is 
discussed, and data concerning its reliability and validity at the 
time of publication are presented. 'The norms for the test are also 
discussed, 


SUMMARY 


The group factor theory of mental ability organization is necessary 
to any program for differentiating from one another an individual's 
potentialities for success in various kinds of work. The present 
results of factor analyses are insufficient for categorizing the varied 
abilities required by the thousands of different American jobs. 
Occupational classifications according to abilities required are per- 
haps a more promising approach to the problem. Special ability tests 
should probably be used in schools more as supplements to other 


334 How to Evaluate 


available data concerning pupils than as routine tests; only in guid- 
ance problems requiring additional special evidence need they be 
used. The chapter concludes with discussions of representative avail- 
able tests which are predictive of ability to succeed with mechanical 
work, manual manipulation work, clerical work, musical training, 
art training, and professional training. 


QUESTIONS 


1. Some special ability tests require equipment too expensive for many 
schools to afford. What steps could be taken to make such tests 
available to many schools whenever needed and at greatly reduced 
cost? 

2. What major industries or occupational classifications in your com- 
munity are not approached by the special ability tests described in 
this chapter? How would you evaluate a pupil's ability to succeed in 
these occupations? 

3. What working relationships exist in your community between employ- 
ment services, public and private, and the schools whose pupils must 
eventually seek jobs there? How can these relationships be improved? 

4. In what colleges or post-secondary schools do many of your com- 
munity’s high school graduates eventually enroll? Is the pupil’s 
ability to succeed in these colleges evaluated as a basis for determin- 
ing the choice of college? Of curriculum in college? Does the second- 
ary school or the college make these evaluations? In what way? 
How would you improve these procedures? 

5. Examine your own experience or that of your associates for instances 
of failure in jobs or college studies which could have been foreseen 
through the evaluation of special abilities, 


REFERENCES 


1. Bell, H. M, Matching Youth and Jobs, Washington: American 
Council on Education, 1940. 

2. Bingham, W. V., Aptitudes and Aptitude Testing, New York: Harper 
& Brothers, 1937. 

3. Hull, C. L, Aptitude Testing, Yonkers: World Book Company, 1928. 

4. Kandel, I. L, Professional Aptitude Tests in Medicine, Law, and 
Engineering, New York: Bureau of Publications, Teachers College; 
Columbia University, 1940, 


Special Abilities 335 


. Kelley, T. L., “Talents and tasks, their conjunction in a democracy 
for wholesome living and national defense," Harvard Education 
Papers, No. x, Cambridge: Graduate School of Education, Harvard 
University, 1940. 

. Shartle, C. L., “Guidance and occupational information,” Studies in 
Higher Education XL, Bulletin of Purdue University, Proceedings of 
the Sixth Annual Guidance Conference held at Purdue University, 
November 29 and 30, 1940. 

« Stead, W. H., Shartle, C. L., and associates, Occupational Counseling 
Techniques: Their Development and Application, New York: Ameri- 
can Book Company, 1940. 

- Thurstone, L. L., The Vectors of Mind, Chicago: University of 
Chicago Press, 1935. 

- Wolfe, D., Factor Analysis to 1940, Psychometric Monographs, Num- 
ber 3. Chicago: University of Chicago Press, 1940. 


CHAPTER XVI 


Adjustment 


IBRERE EE EINER RRERREEER ERES R RR RR RR RR OE REA ERR REFER RES Donn na 0 ESI 


IN THE FIELD OF ADJUSTMENT EVALUATION THE AMOUNT OF WORK DONE 
has been much greater than the usable results obtained. The purpose 
of such evaluation in the school may help us to select from the thou- 
sands of research articles and books in the field of "personality" 
measurement the relatively few which are of practical value to the 
classroom teacher. In Chapter V our answer to the question, “What 
is emotional and social adjustment?” provided us with a first step 
in the evaluation of this aspect of pupils. Having outlined a work- 
ing concept of adjustment and described some of its manifestations, 
both desirable and undesirable, we indicated the major importance 
of its evaluation in any thoroughgoing attempt to provide pupils 
with educational experiences and guidance. Pupils must be aided in 
avoiding the types of maladjustment discussed in Chapter V. Ade- 
quate evaluation of pupils will furnish data enabling the detec- 
tion of such maladjustments, or their prediction and prevention. On 
the positive side, it may assist pupils in improving the adequacy of 
their attempts to meet the demands of living effectively in the home, 
in the school, and in the community at large. 

In vocational guidance, data concerning modes of adjustment will 
be valuable in selecting the type of job for which the pupil is best 
fitted. The nature of his social-emotional adjustment, his ways of 
. getting along with people, is obviously of fundamental importance 
in vocational guidance. It is clear that the pupil who is eager to © 
work with other people and is capable of assuming leadership in 
various social situations is fitted for one type of job, while the 
pupil who works best alone or as a follower belongs in a different 
kind of job or requires reeducation. The pupil with a long history - 

336 


"Adjustment 337 


of behavior problems and an inner life full of worries and anxieties 
will require guidance different from that given the pupil whose 
emotional and social adjustments have been more successful. 

In educational guidance the choice of a curriculum for the in- 
dividual pupil may frequently be determined in large part by the 
nature of his present adjustment. Maladjustment may be due to the 
unfitness of his course of study for his abilities and needs. The 
consequent anxieties or boredoms, aggressions or retreats, will not 
be allowed to escape the teacher's attention if the pupil's adjustment 
is being evaluated. All the symptoms of adjustment and maladjust- 
ment discussed in Chapter V may on occasion in one way or an- 
other be related to factors involved in the school life of the pupil. 
Adequate techniques of evaluating adjustment should not only lay 
bare the existence of different types of maladjustment but also assist 
in their diagnosis and remedial treatment. 

In personal guidance the evaluation of adjustment plays a similarly 
indispensable role. The vast areas of living encompassed by neither 
the school nor the job confront pupils with a multitude of prob- 
lems of a more or less difficult nature. In the process of seeking 
satisfaction for such basic needs as were discussed in Chapter V, 
the pupil’s successes and failures will construct a system of adjust- 
ment or maladjustment. The school can properly undertake its 
function of dealing with the whole pupil only if it is aware of his 
way of getting along with himself and his social environment. It is 
to provide such an awareness that the techniques discussed in this 
chapter are intended. 


DIFFICULTIES IN. EVALUATING. ÁDJUSTMENT 


The teacher should not expect to have as much success in evaluat- 
ing adjustment as in evaluating such aspects as mental ability or 
physical status. Evaluations of adjustment have not yet reached the 
stage of development where their validity and reliability are as well 
established as evaluations of other aspects of pupils. Techniques are, 
of course, available to the psychiatrist and clinical psychologist which 
yield perhaps as much validity in adjustment evaluation as can be 
obtained by the physician’s diagnosis of a physical ailment. But 
these techniques demand far more training and time than the class- 
Toom teacher can give to the problem. 


338 How to Evaluate 


Techniques of adjustment evaluation for the classroom teacher 
must be sufficiently administrable to fit into the total evaluation and 
instructional program of the school. Adjustment evaluation, how- 
ever important, must never become so time-consuming as to inter- 
fere radically with other parts of school work. Adjustment in the 
larger sense is of course the aim of all educational activities, but 
we are here concerned only with that somewhat narrower aspect of 
adjustment which was outlined in Chapter V and distinguished 
from other aspects of pupils. It is with the difficulties of the adjust- 
ment evaluation techniques available and practicable for the class- 
room teacher that the present discussion is concerned. 

Obtaining Frank Responses.—These difficulties are reflected in 
the title of this aspect of pupils. Adjustment is an emotional matter, 
something at which people cannot look in the light of pure reason 
alone. In contemplating their own adjustment, they are more likely 
to become biased, prejudiced, secretive, and deceitful of others and 
of self, than when contemplating their achievement in geometry, 
their physical health, or even their mental ability. Adjustment is a 
social matter that deals with relationships to other people, so that 
there is a stronger desire for social acceptability in terms of adjust- 
ment than in other aspects of individuals, For these reasons, it is 
more difficult to elicit responses which reveal adjustment. 

Pupils conceal responses to questions concerning modes of be- 
havior whose social acceptability or unacceptability is obvious or 
suspected, And this is where the majority of the present techniques 
are weakest, Most of the “personality tests” now available consist 
of undisguised attempts to ask the pupil a list of highly personal 
questions concerning his inner life, peculiarities, mistakes, and faults. 
It is little wonder that Spencer (23 :179-194) had to conclude that 
the personality questionnaire he administered to high school pupils 
would have been invalidated if they had been required to sign their 
Dames. Spencer's review of the literature in the chapter just cited 
indicates quite conclusively the difficulties of obtaining honest re- 
Sponses to direct, undisguised questions concerning adjustments. 

Obtaining Insightful Responses.—In the second place, valid an- 
Swers to many such questions are not given not only because pupils 
are dishonest concerning such matters except in the hands of a clinical 
Psychologist who has established excellent rapport, but also because 


Adjustment 339 


they simply do not know the truth about themselves concerning 
many questions of emotional and social adjustment. For example, 
pupils may be unaware of the answer to the question, "Do you 
daydream frequently?" which would be true for them. They may 
^. often daydream without being aware of it, Similarly, they may differ 
widely in their interpretation of "frequently," some interpreting it 
as once a week, others as once a day. 
Similar lack of insight or understanding may invalidate responses 
"to such questions as “Do you frequently have spells of the blues?” 
or “Do: your ambition need occasional stimulation through contact 
= With successful people?” The objection to such questions on these 
= grounds may be answered by saying that it is not the truth of his 
answer that determines whether a pupil is adjusted but rather the 
way he feels about the question. This defense is at least partially re- 
butted by the fact that answers to such questions are usually inter- 
preted in one way regardless of the significance of the question to 
the pupil or of his unique understanding or lack of understanding 
of it. 

Determining Dimensions of Adjustment.—A third set of diffi- 
culties in evaluating adjustment centers around the problem of the 
Organization of personality and its analysis into traits or dimensions 
by which it may be described and understood. This difficulty was 
touched upon in Chapter V and illustrated by mention of the 
Allport-Odbert psycho-lexical study. Constructing a test and giving 
ita name like “neuroticism” or “introversion” is, of course, not 
Sufficient to set up that trait as a genuine dimension of personality, 
Th order for a trait to be considered a unitary aspect of individual 
adjustment or “personality” it must satisfy certain logical and 
Statistical requirements. It must not correlate with other traits, be- 
Cause it is not pure, unitary, or efficiently descriptive if correlation 
Teveals that it has much in common with others. 

Secondly, the trait must be consistent within itself so that various 
Measures of the same trait, such as items on a personality question- 
Maire, correlate much more highly with one another than they do 
With measures of other supposed traits. In the third place, the 
Measure of the trait must be reliable; it must give the same result 
Upon successive measurements unless it can be shown that the in- 
dividual himself has undergone a change with respect to that trait. 


340 How to Evaluate 


The situation here is comparable to that in the field of mental ability. 
where the need for the isolation of unique, independent, special 
mental abilities rather than a “higgledy-piggledy” conglomeration 
of general mental ability has been pointed out. | 

In another sense, however, the need for unique traits is perhaps. 
not so great in the field of adjustment evaluation if the devices used 
are considered not as precise “measuring instruments” but rather as- 
rough aids to understanding the individual. When used in this way, 
devices for the evaluation of adjustment may assist in merely ine 
dicating some individuals whose maladjustment is sufficiently serious 
to warrant investigation by more refined techniques in the hands of 
a clinical psychologist or psychiatrist. At any rate the desirability of 
some categorization of the areas in which adjustment must take 
place or of the modes in which adjustments can occur has been 
increasingly recognized by personality test builders, i 

Thus the technique of factor analysis has been applied to per- 
sonality traits and adjustment evaluation devices, Flanagan (5) ap- 
plied it to the Bernreuter Personality Inventory, which consists of a 
series of questions to which the pupil responds by circling yes, no, 
or ?. From this analysis of the correlations among the responses t0. 
125 questions, two independent and internally consistent. traits 
emerged: self-confidence and sociability. A similar attempt by the 
Guilfords (7) has resulted in the Nebraska Personality Inventory, 
which measures three traits designated social introversion, emo- 
tionality, and masculinity in such a way that the correlations among 
them are relatively small, and each trait relatively independent of 
the others. Thus factor analysis has been used to help define per- 
sonality tests. The factorial technique is not of itself adequate to 
this task but rather requires psychological insight in the ultimate ' 
determination of the meaning of the cluster of questions and an- 
swers that it statistically derives. 

At the stage of the analysis where the clusters of questions, tests, 
or other measurement units must be interpreted in psychologically 
significant fashion, the process becomes as subjective as any process 
of “armchair psychologizing.” Here the results of factor analysis 
must interact with theories of personality structure, of the nature of 
society, and of the human organism. The value of the results of the Í 


Adjustment 341 


factor analytical method of determining personality traits also de- 
pends upon the validity of the measures of psychological behavior 
which compose the original battery of tests analyzed. Inasmuch as 
the original measures which entered into the factor analyses of 
Flanagan and the Guilfords were made up of highly personal, un- 
disguised, direct questions, their tests retain the disadvantages of 
this approach which were discussed above. They are merely scorable 
in such a way as to yield relatively independent, uncorrelated sub- 
scores whose validity remains to be determined. 

Individualizing Interpretation—A fourth difficulty with most ad- 
justment evaluation techniques is that they provide for the inter-. 
pretation of a given response or bit of evidence in the same way for 
all personalities. In doing so they disregard the principle that a 
given fact about a person reveals his adjustment primarily by 
reason of the way he feels about that fact, This principle follows 
from the discussion of the nature of adjustment in Chapter V. There 
we saw that adjustment depends on the satisfaction of motives or 
the satisfactory solution of problems or overcoming of obstacles. 
The objectively same response from two individuals may mean 
maladjustment for the one and good adjustment for the other. 

Thus an answer to an item on an adjustment questionnaire should 
not be, although it usually is, scored in the same way for all in- 
dividuals. For example, the pupil’s response to the question, “Do 
you make new friends easily?” is frequently interpreted with re- 
spect to his social. adjustment without regard for his feeling about 
the desirability of making new friends easily, his satisfaction or 
dissatisfaction with his ability in this respect, his attitude toward 
others’ opinions of his ability, or the relationship of this attitude to 
other attitudes, The full meaning of the pupil's response to such a 
question depends upon all such associated attitudes, interpretations, 
and individualized meanings. 

Similar difficulties arise in interpreting the answers to most of the 
other questions on “personality tests.” How does the pupil feel 
about his response? Does he consider it desirable or undesirable? 
Does it lower or raise his self-respect? Does he consider his response 
that of the average person? Of the people he admires? Of the ideal 
person? Is it reasonable to assume with respect to a given pupil 


342 How to Evaluate 


that he wants to be like the average person in a particular tr. 
like persons he admires, or like his concept of the ideal per 
The significance for the pupil's adjustment or maladjustment 
any given bit of evidence concerning him, of any behavior iten 
of his answer to any specific question depends upon all such as 
sociated considerations. E. 
Most of the adjustment evaluation devices to be discussed in thi 
chapter have failed to take account of this significant aspect of th 
evidence concerning adjustment. Attempts to circumvent this 
jection have, however, been made by Sweet (27) and Spencer ( 
A description of their work may serve to clarify this difficulty 
adjustment evaluation. Sweet constructed a test of 50 items such a 
“Having Other Folks Praise Me,” “Reciting in Class at School 
and “Washing Dishes.” To each of these items the pupil was to 
dicate on a five-step scale of like-dislike his rating on “How I 
“How most boys feel,” and “How I think I ought to feel.” Fr 
these sets of “self-ordinary-ideal” ratings, seven different s 
were obtained: 
I. Self-criticism—Number of times self is different from 
2. Criticism of average boy—Number of times ordinary 
from ideal 
3. Feeling of difference—Number of times self differs from 
ordinary na 

4. Superiority—Number of times self is rated nearer to idea 
than to ordinary ‘ 
5. Inferiority—Number of times ordinary is rated nearer t 
ideal than to self S 
6. Deviation from the accepted idea of the right—Number 
items marked ideal in the same way that 20 per cent or less 
of some standard group have marked them 
7. Social insight—Number of items marked ordinary in th 
same way that 20 per cent or less of some standard gro 
have marked them 


Adjustment 343 


validity against such criteria of personality and behavior problems 
as psychiatric case studies and social workers’ ratings was also quite 
promising. The technique does not, however, seem to overcome 
adequately the other difficulties of adjustment evaluation, such as 
the need for honesty and insight. The ratings are undisguised and 
direct to a high degree; they assume that pupils will not be in- 
fluenced in their responses by the close juxtaposition of the three 
columns and that all pupils will have the same criteria in their 
responses to "How I ought to feel,” 

Spencer's technique (23) seeks to refine the approach to subjec- 
tive feclings in terms of "fulcra of conflict." Each individual's be- 
havior, experience, and characteristics as reported in his responses 
to a questionnaire are interpreted not in terms of their face value 
but rather in relation to the following six "fulcra" or hinges upon 
which the significance of the individual's experience of conflict 
turns: 

1. The individual's own beliefs, ideals, or aspirations in regard 
to the variable ^ 

2. The individual's report of what he believes are his mother's 
beliefs, ideals, or aspirations in regard to the variable 

3. The individual's report of what he believes are his father’s 
beliefs, ideals, or aspirations in regard to the variable 

4. The individual’s report as to his mother’s own behavior or 
experience in regard to the variable 

5. The individual's report as to his father’s own behavior or 
experience in regard to the variable 

6. The individual’s report as to the behavior or experience of 
his closest associates in regard to the variable 

Thus the pupil fills out seven separate questionnaires. The first 
three deal with “beliefs, ideals, and aspirations,” the first being filled 
out by checking the one statement out of three with which he most 
nearly agrees, the second by checking the one statement with which 
he thinks his mother would most nearly agree, and the third as 
he thinks his father would do it. The next four questionnaires are 
descriptions in turn of his mother, father, associates, and himself in 
Which he checks the one statement out of three which best describes 
cach. The first items of each of the two questionnaires are as fol- 
lows. 


perme ther bad in bbs 
Dd 
deem em 


Ld 
Belin, dala, sand Angas miene 


o Lhat umni Doing. For m 


$t vete iem hm, 6 hulls ler mere 
conem Ment. quias hau tad wpe enl 


Ld 
rmn ee Mee ed Dy i 
IT 
Reng yee md 
we 


—! 
-— 


— IY 

i (3 j hii 

TH ir 

iir i 

TET ith 

ahi 1Hj IH 

n SIT 
nubit] THT SHE din 


346 How to Evaluate 


against clinicians’ ratings (19). The difficulty of obtaining such 
ratings is reflected in the fact that Rogers found it practicable to 
use only 52 children in his study of the test's validity as compared 
with the hundreds of cases usually desired for this purpose. He 
justifies these small numbers on the ground that they are preferable, 

- when accompanied by a satisfactory criterion, to a larger number for 
whom no real criterion is available. 

In view of the difüculty of obtaining criteria for adjustment 
evaluation techniques, many efforts to validate have succumbed to 
the temptation to regard the technique as its own criterion. Al- 
though far less satisfactory, this serves to insure that the device is 
at least reliable and a consistent measure of some trait whose validity 
is then inferred from the nature of the data obtained with the 
device. The nature and extent of this validation difficulty will further 
emerge in our consideration of specific evaluation devices and tech- 
niques, 


AVAILABLE ADJUSTMENT EVALUATION "TECHNIQUES 


Let us now consider the specific techniques of adjustment evalua- 
tion. The task here is to select from the many different approaches 
which psychologists have devised for the study of this aspect of 
human behavior those few which have practical significance for 
the classroom teacher. Many of them must be eliminated because 
of the demands they make on the user of the techniques and the 
interpreter of the data they yield. Thus some of these techniques 
may properly be used only by those with specialized training in 
clinical psychology, while others require refined apparatus only 
rarely available in schools and usable only by experts. The first of 
these classes may be illustrated by the psychiatric interview tech- 
nique, and the second by the various physiological measures of 
emotions, such as the psychogalvanometer. The approaches to "per- 
sonality" which have been employed by psychologists, psychiatrists, 
sociologists, and educationists may be classified as follows: 

1. Self-inventories 

2. Rating devices 

3. Performance tests 

4. Tests of knowledge and judgment 
5. Observational records 


Adjustment 347 


6. Organized behavior descriptions 

7. External physical signs 

8. Free association 

9. Laboratory techniques involving apparatus 

10. Psychiatric interviewing, including psychoanalysis 

11. The autobiography and life history 

12. Projective techniques 

Of the above techniques, those with the highest practical value to 

the classroom teacher at present are (1) self-inventories, (2) rating 
devices, (3) tests of knowledge, judgment, and conduct, (4) records 
of observations, and (5) behavior descriptions. In some cases, the 
imposed life situation may also be useful. Let us now examine each 
of these approaches in turn." 


SELF-INVENTORIES 


Any collection of questions or statements designed to yield data 
on a pupil’s adjustment through his own responses to them may be 
considered a self-inventory or adjustment questionnaire. The ques- 
tions or statements are usually arranged in one of the constant- 
alternative forms, requiring the subject to choose from among the 
alternatives presented the one which best indicates the truth about 
himself. The alternatives are usually Yes and No, although some- 
times a ? is also included. In some cases a continuum is presented 
With Yes and No at the extremes so that the pupil can place a mark 
Somewhere between them. In other cases the continuum is expressed 
with changing alternatives, specific phrases representing the different 
levels of response to each question or statement. Thus instead of 
asking the pupil, “Are you strong? Yes or No,” the item is presented 
as “How strong are you?—Very weak, not very strong, strong, the 
Strongest in my class.” And the question “Do you have any good 
friends?” is supplied with the alternatives “None at all, one or two, 
a few good friends, many friends, hundreds of them.” 

It can readily be understood that self-inventories are essentially 
a system of self-ratings whereby the pupil records his opinion of 

*Information concerning the other techniques may be obtained from P. M. 
Symonds, Diagnosing Personality and Conduct, and from P. M. Symonds and 


Elizabeth A. Samuel, “Projective methods in the study of personality," Review of 
Educational. Research, 11 : 80-93 (1941). 


348 How to Evaluate 


Most of the questions are subjective in the essential sense that onl 
the individual himself can render an opinion concerning them or 
answer them, since they deal with his own inner experience, ob- 
servable by no one but himself, This subjective nature is reflect 
in the instructions usually given to “work rapidly” rather tha 
spend much time in reasoning and introspecting to arrive at th 
true answer to the question. The intent of such directions is 
enable the question to get at a fundamental predisposition or mental 
set based upon the accumulation of past experiences, Elaborate 
reasoning, it is assumed, will do more to confuse and conceal such 
basic inner states than to reveal them. í 

Necessity of Rapport.—Most of the self-inventories at present 
available employ direct, undisguised questions whose purpose can 
be readily detected by all but the most unsophisticated pupil. For 
this reason they are subject in a high degree to the difficulty of 
obtaining frank and insightful responses which was mentioned 
above. With these direct questions special care must be taken in 
administering the self-inventories to establish such a relationship 
between pupil and teacher that the pupil will be willing to reveal 
extremely personal information about himself. He must be made 
to feel that honesty, sincerity, frankness, and insight in answering 
will all work to his benefit. Assurance must usually be given that 
the responses will be held strictly confidential; that is, no one but 
a teacher or counselor whom the pupil trusts and respects should 
be permitted to know how he has answered the questions, When- 
ever pupils come to suspect, on the basis of rumors or actual 
evidence, that their trust has been violated, the future usefulness of | 

all such direct self-inventories is greatly impaired. | 

It is only necessary for the teacher to consider his own desires. 
with respect to the inviolability of his answers to such questions to 
realize the fundamental importance of giving pupils full and honest | 
assurance that their responses will be kept strictly confidential. Even 


Adjustment 249 


in the relatively rare cases where a pupil seems to be unconcerned 
with the number or kind of people who know his personal inner 
life, the same confidential relationship should be maintained both 
to preserve the usefulness of inventories with other pupils and to 
guard against sudden reversals or changes in the sensitivity of the 
particular pupil to publicity. This question of rapport will be further. 
treated in Chapter XX. 

Disguised Inventories.—Of the relatively few attempts to construct 
disguised, indirect self-inventories of adjustment, the majority em- 
ploy questions concerning the pupil's interests. Long lists of ac- 
tivities, occupations, games, and books are presented and the pupil 
is required to indicate whether he likes, dislikes, or is indifferent to 
each of the items. The different patterns of preferences which are 
thus revealed are then compared with the patterns obtained from 
various groups of people upon whom the interest questionnaire was 
validated. This is the technique used by Symonds (28 : 142) in his 
Studiousness Questionnaire, which measures the tendency toward 
studiousness while pretending to discover children’s interests. 

Another attempt at the indirect evaluation of adjustment is 
Sheviakov and Friedberg's Interest Questionnaire (2x). Their three 
inventories containing 600 statements of activities commonly carried 
on by young people seek to approach, in an indirect manner not 
likely to be discovered by pupils, the system of basic aims and 
desires, the personality structure, the emotional tendencies, and the 
underlying drives and goals of the individual. The hundreds of 
statements are classified, during scoring rather than in the test itself, 
into such categories as “acceptance of own impulses,” “severity with 
oneself,” “relationship with family,” “identification with others,” 
“non-identification with others,” “relationship with the same sex.” 
In terms of the pupil's likes, indifferences, and dislikes in each 
category the pattern of his responses and its relationship to the 
dominant trend of the group are interpreted. Apart from its in- 
directness, the authors claim for their technique the advantages of 

` flexibility and a dynamic approach to the description of the in- 
dividual pupil. 

Available Self-inventories—Let us now turn to the description of 
some of the more widely used self-inventories available for the 
Practical purposes of the classroom teacher. In discussing each of 


350 How to Evaluate 


these inventories mention will be made of the formulation of traits 
at which the inventory is aimed, the method by which the items 
aimed at each trait were selected, the method of administering and 
scoring the inventory, and the available evidences concerning its 
validity and reliability. The inventories will be considered in the 
order of the age or grade levels of the pupils for whom they are in- 
tended. Data concerning grade levels, time required, authors, and 
publishers are listed in Table 16. As a result of this survey of avail- 
able devices the reader should obtain a fairly comprehensive notion 
of the techniques which have been used in constructing self-in- 
ventories and of their possibilities and limitations in the evaluation 
of pupil adjustment. 

"The Aspects of Personality test, for the elementary grades, is aimed 
at three "aspects," or traits, of personality: ascendance-submission, 
extroversion-introversion, and emotionality. For each of the three 
sections there are 35 statements after each of which the pupil in- 
dicates whether he feels the “same” or “different.” Typical statements 
are the following: “I have a lot of nerve”; “I like to read before the 
class”; “I feel tired most of the time.” The statements were selected 
(1) from other inventories designed specifically for these three as- 
pects of personality and for different age levels; (2) by a subjective 
analysis of the traits and the items on the basis of the authors’ judg- 
ment; and (3) upon the statistical criterion that an item should cor- 
relate more highly with the total score on the trait in which it was in- 
cluded than with the other two traits. The latter served to insure 
lower intercorrelations among the three sections of the inventory 
than could be obtained merely through subjective judgment of what 
each item measured. The lower intercorrelations, in turn, improved 
the inventory’s efficiency by making each section approach a more 
distinct and separable aspect of personality, 

The wording of the statements is fitted to the grade levels for 
which they are intended, especially in so far as many of the items 
were phrased by children in pre-tests. The classroom teacher may 
administer the inventory as a group test, the sample items being 
read and explained by her. The test blanks are easily scored by means 
of a stencil in terms of the number of “same” or “different” responses 
chosen by the pupil. The scores for boys and girls may be interpreted 
with respect to percentile norms furnished for each of the grades 


Adjustment 351 


from four through nine. The reliability of the three scores by both 
the split-half method and the retest method centered around 75 
for the group of pupils sampled. This relatively low reliability in- 
dicates that any tendencies revealed by the test should be interpreted 
with great caution, especially in the absence of confirming evidence. 

The validity (x6) of this instrument in terms of correlation with 
ratings has not been found high. The lowness of the validity co- 
efficient is, of course, largely due to the unreliability of the criterion 
ratings as well as to inadequacies of the inventory itself, Contrary 
to the test manual, the inventory must not be used and interpreted 
in the straightforward fashion in which group tests of, say, arith- 
metic achievement might be employed. Apart from the total scores 
on the whole inventory and its subsections, inspection of the responses 
to individual items may prove the most revealing and valuable use 
of the test. 

The Brown Personality Inventory for Children is another self- 
inventory applicable in grades 4-9 or to children aged 9-14. The 
80 items, consisting of questions about behavior and feelings with 
Yes or No alternatives, are aimed at the following arcas or aspects 
of adjustment: home, school, physical symptoms, insecurity, and 
irritability. The items for each of these were derived from the litera- 
ture on the neurotic child and validated in terms of internal con- 
sistency with the total score on the test. No attempt was made to 
validate the grouping of the items into the various categories except 
through subjective judgment. 

The inventory is practically self-administering and scoring is 
similarly extremely simplified. The author reports a reliability co- 
efficient of .896. Scores may be interpreted with respect to the decile 
norms based on the total scores of 2748 unselected children of ages 
9-14 combined. The questions are not phrased so as to overcome 
the difficulties of securing frank and insightful responses, nor are the 
other difficulties of adjustment evaluation discussed in the first part 
of this chapter surmounted by this inventory. 

The Maller Case Inventory is aimed at four aspects of adjustment 
in terms of emotional response patterns, adjustment, honesty, and 
ethical judgment and integration. The first of these is approached 
through a controlled association test which requires the pupil to 
choose between two possible word associations with 5o stimulus 


352 How to Evaluate 


- words. One of each pair of associations is considered rational and 
the other irrational. The greater the number of irrational choices or 
associations, the greater the pupil's score for tendency toward emo- 
tionalized response patterns. The adjustment test consists of 50 self- 
inventory items. The honesty test consists of 15 self-scoring items 
entitled "What do you know about sport and hobbies?" The pupil's 


9 questions presenting two dual choice discrimination alternatives 
from which two scores are obtainable: E, the number of “correct” 
ethical discriminations, and I, the pupil's own code of behavior in 
relation to disagreement with the ethical pattern. 
__ The Case Inventory obviously presents a wider variety of evalua- 
tive techniques than the usual straightforward self-inventory. The 
items have been selected in terms of their power to discriminate be- 
tween pupils in New York probationary schools and an equated 
group of normal children. Out of a large number of items, only 
those which did thus discriminate between these contrasted groups 
were retained in the final form of the four subtests. Only the 
validity of the honesty, or trustworthiness, test was assumed to be 
but even this assumption may be questioned in the 
light of the evidence on the specificity of such traits to be discussed 
below, Apart from this general method of insuring validity by 
discriminating between normals and delinquents, the author gives 
little information concerning the validity of the test for evaluating 
adjustment within a relatively normal sample. Similarly, no attempt 
was made to insure that the test in discriminating between these 
two groups was not also discriminating between two different levels 
of intelligence, social status, or school achievement. 
The administration of the Case Inventory is simple and straight- 
forward, and its scoring is relatively uncomplicated except for the 
necessity of doubling the honesty and ethical judgment scores. The 


respectively 94, 92, 4, 95, and 59. The meaning 
of these reliability coefficients is decreased, however, by the fact that 


carlier in this chapter, The purposes of the first and third subtests 
are well concealed and therefore independent of the pupil's frank: 
ness and insight. Much additional evidence is necessary concerning 
the validity of the test with such groups 
countered by classroom teachers, It 
somewhat further the definition of by 
inventory in the second part, and to ascertain whether the 
scoring of the first, second, and fourth parts 
same way for all pupils regardless of the unique organization of 
their individual personalities, Is the same response 
or ethical for all pupils? 

The Loofbourow-Keys Personal Index, intended to discover boys 
in Grades 7 to 9 who are or may become serious behavior problem 
cases, also has four parts. Part 1 measures deception by 
"vocabulary" test containing fictitious words among real words that 
tempts the pupil to give an extravagant picture of his 
He must, of course, not suspect the nature of this test if it is to be 
successful. Part II is a measure of attitudes toward school, minor 
delinquencies, the law, and society by means of a key word or phrase 
which is followed by one socially desirable and three undesirable 
alternatives, The third part, entitled “The Virtues,” is a test of 


= 
E 


354 How to Evaluate 


coefficients with behavior ratings as the criterion ranged from .57 to 
.66, while reliability ranged from .84 to .95. Administration and 
scoring are simple, straightforward, and objective. 

The test is perhaps inadequate in that it aims at areas of adjust- 
ment, namely, problem or disciplinary cases, which teachers and 
school authorities can readily detect merely through alertness and 
observation. Only the fourth part, the adjustment questionnaire, 
would seem to reveal more hidden factors of maladjustment. But for 
the ready identification of symptoms and problems in a group of 
pupils not already well known to the teacher, the test appears to 
have considerable validity. 

The Rogers Test of Personality Adjustment is aimed at the follow- 
ing aspects of personality: feelings of personal inferiority, social ad- 
justment, family adjustment, and daydreaming. The personal in- 
feriority score provides a rough indication of the extent of feelings of 
physical or mental inadequacy on the part of the pupil. The social 
maladjustment score indicates the extent to which he is unhappy in 
his group contacts, poor at making friends, lacking in the social 
skills. The family maladjustment score indicates the degree of his 
conflict and maladjustment in his relations with his parents or 
siblings, such as jealousies, antagonisms, feelings of being rejected, 
overdependence, The daydreaming score indicates the extent to 
which he indulges in fantasies and unrealistic thinking as a means 
of escaping from his problems. As was emphasized in Chapter IV, 
the daydreaming pupil, while not a behavior problem or a disrupter 
of school discipline, may be in a more serious state of mental un- 
health than the classroom rowdy. 

The items of the test were derived largely from the interviewing 
techniques employed by clinical psychologists and psychiatrists in 
working with children, The grouping into four subscores was made 
“more or less naturally” rather than in accordance with any system- 
atic theory of personality structure, An explicit attempt was made 
to secure indirectness in arranging and wording the test material 
in order to increase the probability of frankness on the part of the 
pupil. Also the detection of pupils’ attempts to mislead the examiner 
was facilitated by providing opportunities for overstatement, with 
the expectation that falsified attitudes would frequently also be 
exaggerated and detectable. 


Adjustment 355 


There are six subtests which in varying degrees determine the 
four subscores. Test One requires the pupil to indicate his contact 
with reality or feelings of inferiority by ranking the most preferred 
three out of a list of 16 presented vocations, such as housewife, 
teacher, movie star, and policeman, doctor, engineer. Less valuable 
than the other five parts, this first test was retained as an interesting, 
neutral, and non-emotional beginning for the test. Test Two requires 
a similar ranking of the three strongest wishes in a list of 15, such 
as “to be stronger than I am now,” “to have the boys and girls like 
me better,” “to get along better with my father and mother.” Test 
Three requires the pupil to write the names of the three people 
whom he would most prefer to take with him if he were going to 
live on a desert island. 

In Test Four he rates himself and then his ideal along a 10-point 
yes-no continuum with respect to various traits and qualities. A 
sentence describing a fictitious boy or girl in relation to strength, 
interest, appearance, social acceptability, or relationships to parents 
is given, and the pupil then answers the questions “Am I just like 
him?” and “Do I wish to be just like him?” It is clear that Rogers 
is here attempting to approach various areas of adjustment in 
terms of their significance to the individual pupil rather than in 
an arbitrary fashion; his attempt may be likened to Spencer's in 
using “fulcra of conflict.” The fifth test presents disguised questions 
concerning the areas of adjustment, with from three to six alterna- 
tives for each question from which the pupil may choose his answer 
at various levels of intensity. Test Six requires him to list his entire 
family—parents and siblings and himself—in order of chronological 
age, and then to rank them, as well as his best boy friend and girl 
friend, in order of preference or “love” felt toward each one. 

Administration of the test requires the teacher's close supervision 
and help. The manual gives detailed directions which, while almost 
standardizing the examiner’s every word, are easily followed. 
Scoring the test, although a completely definite and objective pro- 
cedure, is not nearly as simple and purely clerical a task as is usual 
with group tests. Each test must be gone through four times in 
obtaining the four diagnostic scores. Each of the six parts con- 
tributes to each of the four scores in varying amounts and with 


different items. 


356 How to Evaluate 


The author of the test is careful to de-emphasize the value and 
importance of the numerical scores. He considers them merely a 
summary which, while of preliminary survey value, provides less 
insight into the pupil than a study of the latter’s specific responses. 
For this reason the norms are accompanied by directions for inspect- 
ing the specific responses and hints for detecting those that are sig- 
nificant in various ways. This emphasis on the value of direct inspec- 
tion and interpretation of responses to self-inventories might well 
be applied to those of other authors besides Rogers. Rogers, however, 
is almost unique in providing detailed directions for such a clinical 
inspection of responses. As he states, the amount of evidence gained 
by this procedure will depend to a considerable degree on the 
examiner's familiarity with child psychology and behavior. 

In the light of Rogers’ de-emphasis on numerical scores the 
statistical evidences of the test's validity and reliability also become 
less significant. Validity coefficients of correlation between scores on 
the test and the ratings of 52 “problem” children and 84 “normal” 
children by psychiatrists and clinical psychologists were, for the 
personal inferiority score, .39; for social maladjustment, 43; for 
family maladjustment, .38; and for daydreaming, 48. The reliability 
of the total and part scores ranged around .70, The Rogers Test of 
Personality Adjustment is unique for its indirect approach to the 
pupil's “inner life,” its many evidences of clinical sense, its. in- 
terpretation of evidence throughout in a way that makes allowance 
for the uniqueness of each personality, and its deemphasis on 
numerical summary scores while directing attention to the sig- 
nificance of individual responses. In pointing toward less mechanical 
scoring and interpretation, the test requires greater clinical sophistica- 
tion and psychological insight on the part of the examiner. The class- 
room teacher who uses this test to its fullest advantage will have to 
draw upon greater psychological resources than most other self- 
inyentories require. At the same time the numerical scores permit 
as much speed and efficiency in the evaluation of many pupils at a 
time as other tests which do not direct attention to specific responses. 

The Bernreuter Personality Inventory isintended to yield measures 
of the following six aspects of adjustment: neurotic tendencies, self- 
sufficiency, introversion-extroversion, dominance-submission, con- 
fidence, and sociability, The items for each of these traits were 


selected from four earlier tests aimed specifically at them: the 
Thurstone Personality Schedule, the Bernreuter Self-Sufficiency 
Scale, the Laird C2 Introversion Test, and the Allport Ascendance- 
Submission Reaction Study. The correlations between the scores for 
these traits obtained with the Bernreuter Personality Inventory 
and those obtained with the individual tests themselves were all 
sufficiently high to establish the four Bernreuter scores as identical 
with those yielded by the four original tests. In the years since 
Bernreuter's first publication of his inventory numerous studies have 
been made with it; Super (26) has reviewed more than 100 such 
studies, Among their findings is the fact that the neurotic tendency 
and introversion-extroversion scores correlate so highly as to be 
identical. It is with the Bernreuter Personality Inventory that 
Flanagan (5) made one of the first factor analyses of personality 
traits. He derived two scoring keys, for “self-confidence” and 
"sociability," which were sufficient to summarize all the data obtain- 
able from the items of the inventory. In other words, all the in- 
formation contained in Bernreuter's initial four scales may be more 
efficiently summarized in Flanagan's two for self-confidence and 
sociability. 

"The administration of the inventory is similar to that of all group 
tests with the exception of the necessity for securing greater rapport 
between examiner and students in order to increase the probability 
of honest and frank responses. Scoring is somewhat complicated 
since the response to each item is given a different weighting for 
each of the various scores obtained. Bennett (x), however, has found 
that scoring each item zero, one, or two instead of the regular minus 
seven to plus seven yielded results that correlated almost. perfectly 
with regular scores for the two Flanagan scales. 

The reliability of the various parts has been found to center 
around .85, which, as Flanagan (6) has pointed out, is sufficient to 
insure that about 70 per cent of the scores would rate students cor- 
rectly on a five-step scale, and 100 per cent would do so with an 
error of but one step on such a scale. 

Concerning the inventory's validity the evidence is somewhat con- 
flicing but tends toward discouragement. Stagner's report (24) 
of substantial agreement between inventory scores and the impres- 
sians obtained from interviews with more than 100 college fresh- 


358 How to Evaluate 


men is perhaps invalidated by the fact that the interviewers had a 
knowledge of the inventory scores. St. Clair and Seegers (20) found 
the self-confidence score to be fairly valid, but the sociability score 
was less consistent and refined than the neurotic tendency, self- 
sufficiency, and dominance-submission scores as a measure of socia- 
bility. 
Hathaway (9) found the Bernreuter Inventory to be of consider- 
able value as an aid in the diagnosis of psychopathic inferiors, per- 
sons who lack emotional response, ability to profit from emotional 
experience, and respect for social custom. Such persons come to 
. clinics not because of personal complaints but because their lack of 
emotional response brings them into conflict with the law or their 
families. They lack the ability to be humiliated or shamed which 
for most people keeps in check their asocial tendencies. For all nine 
such cases diagnosed over a period of three years, the scores for 


neurotic tendency on the Bernreuter Inventory fell either in the - 


1o per cent most adjusted or entirely off the scale on the adjusted end. 

Speer (22) and Jarvie and Johns (11) have reported two studies 
in situations comparable to those in which classroom teachers might 
use the inventory. Speer found the inventory to show no significant 
differences between problem and non-problem groups of pupils and 
to be of little value in the prediction of problem behavior. Jarvie 
and Johns, of the Rochester Athenaeum and Mechanics Institute, 
similarly correlated Bernreuter scores with ratings given by counsel- 
ors in that institution, and found the relationships very low. It is 
difficult to reconcile these findings with those of Stogdill and Thomas 
(25), who found the inventory distinctly valuable in distinguishing 
between well-adjusted and maladjusted students in the work of a 
student psychological consultation service. The superior rapport 
probably prevailing in such a service, as compared with that in the 
average classroom, is perhaps sufficient to explain the difference. 

This conflicting evidence on the inventory’s validity should serve 
to emphasize the importance of rapport with the student as a major 
determiner of the validity of the scores obtained. This is true not only 
of the Bernreuter Inventory, but also of all other inyentories of the 
same general type. It is only because of the greater amount of re- 
search done with this inventory that this emphasis is made in con- 
nection with it. The whole point is well illustrated by a conversation 
which Bernreuter reports (2) : 


Adjustment 359 


One boy came into our clinic for advice, and we started to give him 
a personality inventory. He said, “I took one of these last week.” 

"Where?" 

“A professor in my psychology class gave me one." I said, "In that 
case you will not need to take this, If it is all right with you I will ask 
him to let us have that one and it will save you the trouble of taking 
the test again, and it will save us the trouble of rescoring it." 

The boy hesitated and said, "I think you had better give it to me. T 
don’t like that professor. I don't want him to know what I’m like, and 
I didn’t tell him the truth on that test. However, if you will give me one 
now, I will tell you the truth.” 


Not only is good rapport necessary if the full usefulness of self- 
inventories is to be realized, but also a willingness on the part of 
teachers and counselors to become intimately familiar with the in- 
dividual items of these inventories. As was emphasized above in 
connection with the Rogers Test of Personality Adjustment, such 
inventories are frequently more useful as check lists of symptoms 
than in terms of their total or part numerical scores. Whenever the 
seriousness or interest of an individual student's case so indicates, 
the responses to individual questions should be studied for what 
they can reveal concerning his adjustment. 

The Bell Adjustment Inventory is designed to provide measures 
of home adjustment, health adjustment, social adjustment, and emo- 
tional adjustment. It attempts to tap these four areas by 140 yes-no-? 
questions. The questions for all four fields are presented together in 
random order and the scoring key enables the selection of the 
responses relevant to each area. 

Based on 258 college freshmen and juniors, the reliability of the 
total adjustment score is reported to be .93, while the reliabilities 
of the four subscores range from .80 to .89. Validation of the in- 
ventory was found to be satisfactory for individual items in terms of 
their power to discriminate between individuals in the upper and 
lower 15 per cent of total scores. 

Validation of the separate subscores was also found satisfactory in 
terms of power to discriminate between groups of individuals 
selected by counselors as well adjusted and poorly adjusted. The 
counselors who selected these extreme groups all held professional 
positions in widely separated localities and had had five or more 
years of experience in counseling maladjusted individuals. Inde- 


360 How to Evaluate 


pendent studies such as that by Peterson (15) have also found sig- 
nificant differences between the scores of individuals determined by 
ratings and interviews to be well adjusted and poorly adjusted in the 
specific areas of the inventory. Within the limitations of all direct, 
undisguised, arbitrarily scored adjustment inventories, the Bell Ad- 
justment Inventory constitutes a compact, simple, and useful device. 

The Nebraska Personality Inventory represents another of the 
few self-inventories developed with factor analysis techniques. The 
three factors which the test attempts to measure are social in- 
troversion, emotionality, and masculinity. 'The ror questions have 
been built around the 36 original questions which entered into the 
factor analysis. The reliabilities of the social introversion and emo- 
tionality scores based on 665 men and women range around .go, and 
the reliability of the masculinity score ranges around .65. Apart from 
its derivation by factor analysis, the inventory differs little from those 
already considered in this section. The correlation of about .40 
between the social introversion and the other two scores is in- 
dicative that one purpose of the factor analysis—to obtain uncor- 
related subscores—was not fully realized. 

Usefulness of Self-inventories—In general, the self-inventory 
technique of evaluating social and emotional adjustment may 
usually be considered desirable and helpful, but only seldom even 
nearly sufficient. The speed, efficiency, and low cost of evaluation by 
the self-inventory method make it likely that in the hands of in- 
telligent and well-grounded teachers it may frequently yield worth- 
while results and quick insights that could otherwise be obtained 
only through far more laborious and demanding procedures. The 
reliability or consistency of the results obtained with self-inventories 
is usually far greater than that obtained with any of the other avail- 
able methods, such as rating devices, observational techniques, inter- 
views, or descriptive records. As a device for screening out of a large 
number of pupils those who are most maladjusted and in need of 
further evaluation and guidance, the self-inventory, when used with 
proper respect for its limitations, seldom does harm and frequently 
proves valuable. The validity of self-inventories is still debatable, 
because it depends not only on what is furnished in the inventory 
in the form of content, scoring techniques, and interpretative hints, 
but also upon the conditions under which they are administered. At 


CE 


Adjustment ; 361 


the present stage of their development, self-inventories cannot 
function well without rapport. When there is rapport between the 
pupil and those whom he expects to examine his responses and learn 
about his “inner” life, the self-inventory may be expected to be 
valid to a considerable degree. Where rapport is absent, validity 
should not be expected. But even after rapport has been established, 
the examiner still has the responsibility of (1) selecting the best 
possible inventory for his particular conditions, problems, and 
facilities, and (2) interpreting the evidence in accordance with both 
the test author's instructions and his own common sense, psychologi- 
cal insight, and training. The teacher who has no background of 
reading and study in the fields of personality and mental hygiene 
should obtain one before attempting to use and interpret a self- 
inventory. 


TABLE 16.—SELF-INVENTORIES 
Time (min.) Publisher No. 


Grades á 
Name of Inventory Designed for No. of Forms Benke to m bers, 


Aspects of Personality 49 X 3035 38 
Pintner, Loftus, For- 
lano, Alster) 


Personality Inventory for 49 1 15 26 
Children (Brown) 

Case Inventory (Maller) 5 2 15755 6 

Personal Index (Loof- 7-9 (boys) 1 30-40 14 
bourow-Keys) 

Test of Personality Ad- 4-10 1 40-50 3 
justment (Rogers) 

Personality Inventory 9-16 (adults) I 25 34 
CBernreuter) 

Adjustment Inventory —— 9-16 adults) H 25 34 
(Bell) 

Nebraska Personality In- 9-16 (adults) I 25 32 
ventory (Guilford) 


Ratinc METHODS 


This class of adjustment evaluation techniques attempts to sys- 
tematize the opinions or judgments of a pupil held by those who are 
qualified to have them. Rating methods proceed from the simple 
notion that one way of finding out about a pupil's adjustment is to 


362 How to Evaluate 


ask pertinent questions about him of persons who have had an 
opportunity to live with him, to observe him in action, and to have 
his personality impinge upon their own. The rating scale merely 
organizes these questions and furnishes a means of recording re- 
sponses so as to quantify them and reduce various kinds of error 
to a minimum. 

Errors in Rating Methods.—Various errors may occur in obtaining 
an opinion about one person from another individual. In the first 
place, the person expressing the opinion may not know whereof 
he speaks. Unless he has had an opportunity to become acquainted 
with the pupil and to observe him along lines pertinent to the ques- 
tion he is answering, he cannot give valid evidence. For the rating 
device is simply a means of making explicit, organizing, and record- 
ing evaluations already made. And the evaluations must themselves 
have been based upon valid observations of some kind, preferably 
first hand, but reliable in any case. 

In the second place, the rater, like the pupil who answers questions 
on a self-inventory, must be both able and willing to give an “un- 
biased” opinion. As compared with self-inventories, rating devices 
have a better chance of creating willingness to be unbiased and 
accurate in rendering an opinion, because the opinion is not about 
oneself but about someone else. However, the rater’s ability to have 
insight into a pupil’s adjustment is, of course, less than in the case 
of the self-inventory, especially concerning those aspects of adjust- 
ment understood when we speak of the "inner" life of a person. The 
teacher or other pupils are more likely to be unbiased and frank 
about a pupil than he is likely to be about himself. But this advantage 
may be lessened in the case of pupils with whom the teacher or other 
raters are very well acquainted through either long association or 
intimate friendship. When this relationship exists, ratings have often 
been found to be too high, to show leniency and favoritism. 

A third kind of error inclusive of the second type is the "halo" 
effect in ratings. This is the effect of the rater's general opinion or 
overall impression of a pupil on his ratings of more specific aspects 
or qualities of that pupil. If a pupil is considered low in one seem- 
ingly important aspect of his personality or adjustment, there is a 
tendency to rate him lower in other aspects than he really is. The 
same tendency may prevail in the opposite direction when the pupil 


Adjustment 363 


is considered to rank so high in one aspect as to cast an emotional 
haze over the opinion of him in other aspects. The rating of a 
particular aspect is subject to the halo effect in so far as it is not 
easily or commonly observed or thought about, is not clearly defined, 
involves relationship to other people rather than being “self-con- 
tained,” or is highly loaded with emotional or moral significance. 
In order to prevent the halo effect from distorting and invalidating 
ratings, the rater must constantly keep in mind the necessity of con- 
sidering each trait separately, of not letting one rating be influenced 
by his other ratings of the pupil. For example, if a pupil is highly 
admired and highly rated for his honesty, his popularity must still 
be considered separately and be neither raised nor lowered by the 
honesty rating. 

Since no single rating is ever perfectly reliable, just as no single 
physical measurement is ever perfectly reliable, steps should be 
taken to increase the reliability by averaging the ratings of several 
raters. This means, of course, that several individuals should rate 
each pupil, The reliability and validity of the ratings obtained from 
equally well-trained and instructed raters should increase as the 
number of raters increases, But usually the increase is not worth 
the trouble after about ten ratings have been averaged, except in 
cases where larger numbers of raters are easily available and a high 
reliability is desired. This point will be further discussed in Chapter 
XIX, The Evaluation of the Teacher. 

Training Raters—Training and instruction are essential for 
raters, whether they be teachers or pupils, if the various kinds of 
errors are to be minimized. Apart from the directions which are 
specific for the particular rating device used, several general pro- 
cedures and precautions will tend to increase the validity of ratings. 
For some of them experimental support is lacking; they have ap- 
pealed to common sense to such an extent as to become widely ac- 
cepted. In the first place, the ratings should be made independently, 
be based upon the rater's own opinion rather than determined by 
consultation with someone else. Unless this procedure is followed, the 
ratings will be of questionable value. If another person scems to be 
a valuable source of information concerning the pupil, he should 
make the ratings rather than merely serve as a consultant for 
someone who does not know the pupil as well. 


that honeuy always goes with religous be will wai t pve a 


pupil 

about the individual ‘This error can be avoided partly by de 
fining traits more ard 

hore. In sdilet, tedli? be inpendio NE 
conudering cach trait independently ed other trito. Not oby 

the ratings on other traia be dowrgonded, ss in commteractiong due 


Riiie petii Chajar rp 
Here merdy Ld ed thar pranda wf 
we ": vim an seeded om det 


| 
1 
i 


HELLA 
It 
Ft EE 
di 

i 
n l 
HIR 


366 How to Evaluate 


Second to the importance of the traits in adjustment as a basis for 
including them in a rating device is their amenability to evaluation. 
Judges will agree with one another far more in rating on some 
traits than on others. These differences are not necessarily inherent 
in the traits themselves, however; they may merely reflect dif- 
ferences in the methods by which the ratings were made or the lack 
of opportunity to observe behavior indicative of the trait. A trait 
may be rated with little consistency when presented by one method 
and with much greater consistency when presented by another. The 
findings of various investigators with respect to the amenability 
of various traits to rating are as follows: 


Traits Reliably Rated Traits Unreliably Rated 


Efficiency, originality, persever- Courage, unselfishness, integrity, 
ance, quickness, judgment, cooperativeness, cheerfulness, 
clearness, energy, will, scholar- ^^ kindness, judicial sense, punc- 
ship, leadership tuality, tact 


It is perhaps apparent from these lists that the traits which mani- 
fest themselves frequently and provide numerous opportunities for 
observation are more reliably rated than those that occur rarely and 
only under exceptional conditions. Similarly, traits are better 
tated when they meet the requirements discussed above in con- 
nection with avoiding the halo effect. The traits should be easily 
and commonly observable, rather frequently thought about, and 
clearly defined, E 

Once the traits to be rated háve been decided upon, the next step is 


to provide some means of placing each person somewhere along the ` 


Tange or continuum of each trait. As with product rating scales this 
may be done in several ways, most of which may be grouped under 
the following three headings: (1) descriptive rating scales, (2) nu- 
merical rating scales, and (3) graphic rating scales, The descriptive 
rating scale provides for each trait a list of descriptive phrases, 
usually from three to seven in number, from which the rater selects 
the one most applicable to the person being rated. In the numerical 
rating scale only numbers are assigned for each trait; when com- 
bined with the descriptive scale, the descriptive phrases are arranged 
in order of the degree, level, or intensity with which they indicate 


Adjustment 367 


possession or lack of the trait; then they are numbered from 
oor 1 to the highest number necessary. Ratings are assigned to the 
person being rated in the form of numbers, the higher end of the 
numerical scale usually indicating the desirable end of the range 
for each trait.” In the graphic rating scale, the descriptive phrases 
are printed horizontally at various points underneath a straight line 
about five inches long. The rater indicates the subject’s standing with 
respect to each trait by placing a check mark at an appropriate 
point along the line. Illustrations of the descriptive and numerical 
types are given below. 


Descriptive Rating Scale 


Directions: Place a check mark in the square before the phrase which 
represents your evaluation of the pupil. 
Is this pupil honest? [Q Honest in all situations. Never dishonest. 
Can he be trusted to [7] Usually honest. Rarely yields to temptation. 
resist temptations to C Not dependably honest. 
steal, lie, or cheat? O Usually dishonest. Not to be trusted. 
Dishonest in all situations. 


Numerical Rating Scale 


Directions: Give the pupil a number from o to ro to represent the 
degree to which he possesses the traits listed. o represents none of the 
trait, 5 an average amount, and 10 a maximum amount of the trait. 


Is this pupil honest? Can he be trusted to resist temptations to 
steal, lie, or cheat? X AM 


The graphic rating scale is probably the most frequently used o 
these three types, since it embodies the features of both of the others. 
In general, all three can be put in either the constant alternative or 
the changing alternative forms described in Chapter IX. When each 
trait is put in the constant alternative form, the description of each 
level of it is the same for all traits and consequently must be in 
general terms. An example of the constant alternative form is such 
directions as the following: “Give the person who possesses none of 


2]n the man-to-man rating scale the names of actual persons illustrating each level 
of the trait are substituted for or added to the descriptive phrases and are assigned. 


numbers ranging, in steps of three, from 3 to 15. 


368 How to Evaluate 


each trait considered a score of zero and give the person who 
possesses a maximum amount of the trait in question a score of 10. 
The person who possesses an average or usual amount of the trait- 
should be given a score of s. Intermediate values between these 
should be given appropriate numerical values, from zero to 5 for 
average or below average, and from 5 to 10 for average or above 
average." Such adjectives as excellent, good, fair, poor also illustrate 
the general terms which may be used for constant alternative rating 
scales. In the changing alternative form separate descriptions are 
provided for each level of each trait. The following illustrations of 
graphic scales show how alternatives for each trait may be either 
constant or changing: 


Constant Alternatives 


Is this 
pupil honest? 
Extremely Rather Somewhat Hardly Not at all 


Is this 
pupil neat? 
Extremely Rather Somewhat Hardly Not at all 
Changing Alternatives 
Is this 
pupil honest? 
Honest in Usually hon- Not depend- Usually dis- Dishonest in 
all situa- est. Rarely ably honest. honest, Not all situations, 
tions, yields to Yields to to be trusted. Pathologically 
temptation. strong tempta- deceitful, 
tion. 
Is this 
pupil neat? 


Always ^ Usually well Clean and Usually slovenly Invariably 

fastidious groomed and orderly about and unkempt altogether 

and im- orderly, with half the time. but sometimes unkempt 

maculate, occasional spruces up, and dirty. 
lapses. 


Rules for Constructing Graphic Rating Scales.—In the course of 
the great amount of work which has been done in the development 
of graphic rating scales, the following suggestions for their con- 
struction have been made. It should be remembered, however, that 
these suggestions are for the most part based not on experimental 
findings but merely on common sense, which has frequently been 
refuted in the field of testing by experimental facts, 


Adjustment 369 


1. Alternation of Scale Directionality.—Most writers on the sub- 
ject suggest that the desirable end of the rating line should not 
always be on the same side of the page but should be randomly 
alternated, In this way, they believe, the halo effect may be somewhat 
reduced because the rater will have to curb his tendency to check all 
traits at the same end of the scale. There is, however, some evidence 
(4 : 15-16) contrary to this belief. 

2. Explicit Trait Questions.—Each trait should be introduced by a 
question phrased so as to describe it in objective, observable, tangible 
terms, For example, rather than using the phrase, "Appearance and 
manner," it is better to ask the rater, *How are you and others 
affected by his appearance and manner?" To give another example, 
rather than merely labeling a trait "Emotional stability," it is better 
to follow this title with such questions as “How well poised is he 
emotionally? Is he touchy, sensitive to criticism, easily upset? Is he 
irritated or impatient when things go wrong? Or does he keep 
an even keel?" 

3. Continuous Lines.—Either a great many or no segments or 
sections should be marked off on the line. In this way the continuity 
of the trait is better emphasized and the rater can feel freer to place 
his check mark at any point along the line. In scoring the rating 
scale, a strip of paper marked off into as many divisions as is de- 
sirable can be placed along the line and the distance or number of 
divisions from the undesirable end of the line can be expressed 
numerically. 

4. Number and Position of Descriptive Phrases.—The descriptive 
adjectives or phrases, preferably either three or five in number, 
should be placed underneath the line with definite spaces between 
them. 

5. Order of Descriptive Phrases—The extreme phrases should be 
put at the very end of the line, the neutral or average phrase at the 
middle of the line, and the intermediate ones in between. 

6. Diction of the Phrases—The words used to describe cach level 
should be fitted to the understanding of the persons who will make 
the ratings. When children in the elementary grades are the raters, 
slang and colloquial expressions may sometimes prove advantageous. 

7. Phrasing of Extreme Levels—The phrases at the ends of the 
line should not express levels of the trait which are so rare or ex- 


370 How to Evaluate 


treme that raters will check them only seldom cr avoid them 
altogether. 

8. Phrasing of Intermediate Levels.—ln order to influence raters 
toward using a wider range of the line the meaning of the inter- 
mediate levels should be closer to the average or neutral phrase than 
to the extreme phrases. This may help counteract the tendency for 
ratings to be concentrated around the middle of the line. 

Intra-group Rating Devices.—lt is often desirable to secure ratings 
of pupils from all the other pupils in their classroom or home room. 
Various techniques have been devised for this purpose. Among 
these, one of the more carefully designed is the “Guess-Who” test 
of Hartshorne and May (8). Its use is adequately described by the 
directions: 

Here are some little word-pictures of children you may know. Read 
cach statement carefully and see if you can guess who it is about. It 
might be about yourself, There may be more than one picture for the 
same person, Several boys or girls may fit one picture. Read each state- 
ment, Think over your classmates, and write after each statement the 
names of any boys or girls who may fit it. If the picture does not seem 
to fit anyone in your class, put down no names, but go on to the next 
statement. Work carefully, and use your judgment. 

1, Here is the class athlete, He (or she) can play baseball, basket- 
ball, and tennis, can swim as well as any and is a good sport. 


In scoring the "Guess-Who" test many methods have been tried, 
but adequate results have been obtained simply by adding the num- 
ber of times a pupil's name appears, a positive value being given to 
“good” items and a negative value to “bad” ones. In practice it has 
been found that many pupils in the group are mentioned for a given 
item and that each pupil is usually mentioned for a fairly large num- 
ber of items. Thus a rating is usually derived for many pupils over 
a fairly large number of items, and the results are quite reliable. 
A similar technique, a rather obvious variant of the “social 
distance” concept of Bogardus (3), has been used by Moreno (12) 
in measuring the amount of organization shown by social groups. 
His “sociometric test” requires an individual to choose his associate 


Adjustment 371 
for any group of which he is or might become a member, In ad- 
ministering the test the pupils are instructed as follows: 


You are seated now according to directions your teacher has given 
you. The neighbor who sits beside you is not chosen by you, You are 
now given the opportunity to choose the boy or girl whom you would 
like to have sit on either side of you. Write down whom you would like 
first best; then, whom you would like second best. Look around and 
make up your mind. Remember that next term the friends you choose 
now may sit beside you. 


It may be assumed that the pupils who were frequently chosen by 
other pupils for such close association had a high degree of social 
acceptability in the classroom. Other pupil aspects might be de- 
termined by requiring the pupils to choose an associate for other 
purposes. Such intra-group rating of each individual in a group 
by all the others can, when properly administered, throw valuable 
light on the social organization of classrooms, study groups, and 
other social units. 

Available Rating Devices.—We are now ready to examine some 
of the commercially available rating devices designed to provide 
practical assistance in the evaluation of social and emotional ad- 
justment. Our descriptions will deal with the traits for which the 
device is designed, the technique used in administering it, the method 
by which the various levels of each trait are presented, and. the 
validity and reliability of the device. Here again the discussion will 
follow the age or grade level of the pupils for whom the rating 


are such aspects of behavior as cheating, lying, temper outburst, 
speech difficulties, imaginative lying, sex offenses, truancy, and de- 
fiance to discipline. The total score for this schedule is determined 


light of the seriousness of the given problem. The “seriousness was 
determined from clinical records for a group of pupils upon whom 
the schedule was standardized; of course, this may not accord with 


372 How to Evaluate 


the principle that various kinds of behavior may differ in serious- 
ness for different individuals. The 35 traits in Schedule B are 
classified into four groups: intellectual, physical, social, and emo- 
tional. Here again the level of each trait is weighted not in the 
usual fashion where the extreme ends invariably receive the highest 
or lowest score, but rather in terms of the average scores on Schedule 
A for a standardization group of pupils who were rated for each of 
the various levels of each item in Schedule B. For example, the 
following ratings to the question, “Is he abstracted or wide awake?” 
are weighted in the order 5, 4, 2, 1, 3 rather than the usual 5, 4, 
Braet: 


2. Is he abstracted or wide awake? Score 
| | 
Continually Frequently Usually Wide Keenly 
absorbedin becomes present- , awake. alive and 
himself. abstracted. minded. alert. E 
(5) (4) (2) (1) (3) 


The total score on Schedule B was found (14) to have a reliability 
of .86 in terms of reratings by five teachers of the same 182 primary 
grade pupils after a short interval of time, and (14 :29) a reliability 
of .92 in terms of the corrected correlation between split halves. 
When various pairs of junior school teachers rated the same small 
groups of children, they agreed to the extent of a correlation co- 
efficient of about .60. Evidence of validity is the correlation of .76 
between the total score on both schedules and the frequency with 
which pupils were sent to their school principals for disciplinary 
purposes. The ratings were also found to discriminate well be- 
tween pupils referred to a child guidance clinic and “normal” pupils. 
The manual accompanying the schedules contains adequate in- 
structions for their use and interpretation, especially concerning pre- 
cautions in applying the norms to all groups of pupils and the need 
for supplementary data before decisive interpretations are made. 
The Winnetka Scale for Rating School Behavior and Attitudes, 
based on thousands of observations and records of classroom be- 
havior incidents, consists of thirteen classroom situations each pro- 
vided with five or more specific levels of behavior. A typical situation 


Adjustment 373 


is “When there is a group project to be carried out,” and a typical 
phrasing of a level of this trait is “Withdraws from group and 
carries on non-valuable activities.” The scores for each level, rang- 
ing from o to 10, are interpretable in relation to the scores of ap- 
proximately 1200 pupils. The thirteen items are grouped into the fol- 
lowing five categories on the basis of a factor analysis: cooperation, 
social consciousness, emotional control, leadership, and responsibility, 
The form is designed for obtaining ratings over a two-year period 
and for combining the ratings at the end of each year into a com- 
posite rating on each of the five traits. The reliability, based on re- 
ratings by teachers after two to eight weeks, was .87 for the complete 
scale and from .72 to .82 for the five categories. Evidence of validity 
was found in the correlation of .7r with the Haggerty-Olson- 
Wickman Behavior Rating Schedules. 

The American Council on Education Personality Rating Scale 
is available in two revisions, A and B, the first of which is a graphic 
scale and the second a descriptive scale for five traits: appearance and 
manner, industry, ability to control others, emotional control, and 
distribution of time and energy. The directions emphasize the im- 
portance of observations of the student as the basis for ratings. Space 
is provided in each of the traits for checking “no opportunity to 
observe.” There is also ample space, together with illustrations, for 
recording “instances that support your judgment.” The scale has 
been carefully designed and studied with a view to its usefulness for 
college entrance officers. A reliability coefficient of .77 was obtained 
by correlating the average of three raters of 107 freshmen against 
the average of three other raters of this group. 

The Vineland Social Maturity Scale consists of 117 items arranged 
in the order in which normal individuals ranging from 1 month to 
25 years of age manifest various aspects of social maturity. More 
a behavior description form than a rating scale, it requires that the 
examiner base his judgment upon a detailed interview either with 
an adult well acquainted with the subject or with the subject him- 
self. It is similar to an individual intelligence test, especially in the 
training requirements for the examiner. Typical items are the fol- 
lowing: At age o-r, balances head, grasps objects within reach; 
age 7-8, tells time to quarter of an hour, disavows a literal Santa 
Claus; age 25, systematizes own work, shares community responsi- 


374 How to Evaluate 


bilities. The various kinds of scores, including a social quotient, 
which is the subject’s social age divided by his chronological age, 
provide easily understood measures of social adjustment. The 117 
items are placed in eight categories: self-help general, self-help eat- 
ing, self-help dressing, self-direction, occupation, communication, 
locomotion, and socialization. The reliability of the scale, based on 
retests of 125 individuals of various ages after intervals of from 1 day 
to 9 months, was found to be .92. 

The B E C Personality Rating Schedule prepared by the Business 
Education Council provides graphic rating scales for 29 aspects of 
personality grouped under eight principal scales: mental alertness, 
initiative, dependability, cooperativeness, judgment, personal im- 
pression, courtesy, and health. Such aspects as mental alertness and 
physical health, although better evaluated by other techniques, do 
not detract substantially from the value of the scale, for the teacher's 
impressions of these aspects may be of value in themselves. The 
descriptive phrases for the levels of each trait are carefully worded 
in terms of observable behavior rather than qualities. Although the 
directions for using the scale emphasize the need for training and 
retraining the raters, no space is provided for indicating "no op- 
portunity to observe" or instances of behavior which support the 
ratings given. 

Usefulness of Rating Methods.—As in the case of self-inventories, 
rating devices can furnish valuable but seldom sufficient informa- 
tion concerning the emotional and social adjustment of pupils. 
Whereas the most essential factor in the use of self-inventories is 
rapport, the crucial factor in the use of rating devices is the training, 
sincerity, enthusiasm, and observational powers of the teachers or 
other persons who render the judgments. When the raters are ade- 
quate in these respects and when several ratings are averaged for 
each pupil, the data obtained by means of a well-designed rating 
device may well prove as reliable and valid as those obtained by the 
other means at present available. A program of ratings should not 
be entered upon by a teacher or school in a casual manner. Rather it 
should be preceded by careful training discussions of the importance 
of the rating to be made, of the nature of the device to be used, and 
of the precautions necessary to avoid the various kinds of error dis- 
cussed above. More than with most other kinds of evaluation, a 


Adjustment 375 


program for rating pupils depends on democratic motivation. The 
raters’ hesitancies and objections with regard to the rating scheme 
must be respected. Only after these objections have been eliminated 
should training and instruction in the use of the rating device 
proceed. 


Taste 17.—RariNG Devices 


Publisher No. 


Name of Inventory Magne ed Ne. (see list on 
: orm pp. 561-562) 

Behavior Rating Schedule (Haggerty-Ol- Kdg.-12 1 38 

son-Wickman) 4 
Winnetka Scale for Rating School Behavior Nursery school- I 37 

and Attitudes (Van Alstyne) 6 
American Council on Education Personal- 9-13 2 2 

ity Rating Scale 
Vineland Social Maturity Scale (Doll) Infancy through I 17 

adulthood 

BEC Personality Rating Schedule 7-16 adults I 18 
Behavior Description Form 712 1 15 


Tests or CONDUCT, KNOWLEDGE, AND JUDGMENT 


Many ingenious techniques have been devised for the measure- 
ment of pupils’ actual conduct or behavior as related to their 
honesty, suggestibility, persistence, cooperation or service, inhibition, 
caution, speed of decision, aggressiveness, and studiousness or effort 
in school. Similarly a wide variety of ingenious techniques has been 
designed for the measurement of their knowledge and judgment 
of what is desirable and undesirable, good and bad, probable and 
itnprobable as a consequence, serious and trivial, in the moral and 
ethical problems of everyday life. Symonds (28) has presented a 
valuable summary of the work in this area of evaluation, and the 
reader is referred to his book for a more detailed treatment than is 
given here. We shall present merely some of the findings of in- 
vestigators in this field which indicate that these measures provide 
relatively little of practical value to the classroom teacher in the 
evaluation of emotional and social adjustment. 

In general, it has been found possible to design tests of this type 
which possess a satisfactory degree of reliability or self-consistency. 


376 How to Evaluate 


On the other hand their validity for practical purposes has proved 
to be disappointingly small. In tests of actual conduct or behavior 
the major drawback is the high degree of specificity which they 
exhibit. Even so narrow an aspect of conduct as honesty in the 
classroom situation has been found to be highly specific, so that the 
correlation between measures of classroom honesty obtained by tests 
of copying and tests requiring the pupil to score his own paper 
honestly was quite low. When the test of honesty differs to only a 
slightly greater degree, such as tests of cheating in relationship to 
tests of stealing or lying, the correlations practically disappear. 

Since the performance test for each trait, such as honesty, per- 
sistence, or aggressiveness, did not give the same results when meas- 
ured by techniques which, on their face value, apparently approached 
the same aspect of personality from different directions, it seemed” 
necessary to conclude that the existence of such a general trait of 
personality was doubtful and that each type of behavior was a 
separate trait in itself, unrelated to other traits. Consequently, the 
usefulness of such tests for the prediction and diagnosis of any 
aspect of adjustment sufficiently broad to be of practical value was 
almost negligible. 

'The findings of Hartshorne and May, on which most of these 
conclusions concerning the specificity of moral habits are based, 
have been criticized by Murphy and Likert (13) on the ground, 
first, that the number of tests used to measure each character trait 
was too small and, second, that Hartshorne and May themselves 
found it necessary in grouping their tests to assume that a general 
honesty trait exists in the psychological sense. Murphy and Likert 
argue that a larger number of similar tests (the largest number 
used by Hartshorne and May in their honesty test was nine) more 
similar in nature would have been more successful in revealing a 
general tendency to behave honestly, 

Whatever the theoretical significance of this argument may be, 
the fact remains that for practical purposes the evaluation of ad- 
justment or character by direct tests of conduct or behavior with the 
techniques thus far devised is too cumbersome, Each of the nine 
honesty tests found by Hartshorne and May to be unsuccessful in 
measuring a general honesty trait involved as much cost, time, and 
general effort as is required by a single classroom achievement test. 


Adjustment 377 


To multiply such tests to the number required for reliable measure- 
ments of honesty or any other character trait by the conduct tests at 
present available would probably place an unbearable burden upon 
the resources of most schools. 

The major shortcoming of tests of knowledge and judgment in 
moral and ethical matters is the fact that their correlations with 
conduct are very low. That is, they have been found invalid for 
the purpose of differentiating between normal and delinquent in- 
dividuals or between cheaters and non-cheaters (28 : 289). Of one 
aspect of adjustment, however, these tests do provide worth while 
measurements, namely, the degree to which an individual has be- 
come aware of the moral code of the group in which he lives. The 
tests at least indicate the degree to which a pupil knows what is 
right even if they do not indicate whether he will do what is con- 
sidered right, decent, altruistic, and praiseworthy by the culture 
in which he lives. Whether this knowledge is concerned in any way 
with a pupil’s emotional and social adjustment, his happiness and 
social acceptability, must yet be shown. It is possible that these tests 
may reveal cases of social maladjustment when they are caused by 
the pupil’s failure to “talk the same language” with respect to 
moral and ethical questions as his fellow pupils do. 


Anecpotat Benavior Recorps 


The inadequacies of self-inventories, rating devices, moral knowl- 
edge and judgment test, and performance tests as methods for 
the evaluation of social and emotional adjustment have led to the 
development of a more direct observational approach known as the 
anecdotal behavior record. This record is a collection of brief re- 
ports of incidents in the life of a pupil which seem to be significant 
as a revelation of his emotional and social adjustment. Probably 
the first mention of anecdotal records in the present context was 
made in the report of the Committee on Personnel Methods of the 
American Council on Education in 1928 (17). Since that time 
numerous studies and reports of experience with them have become 
available, most of which have been summarized by Traxler (29) 
and by Jarvie and Ellingson (xo). The present discussion will be 
drawn largely from these two books; the reader is referred to them 
for a fuller treatment than is possible here. 


378 How to Evaluate 


Since anecdotes are reports of observations of what a pupil says 
or does in any of the numerous classroom, recreational, social, or 
intellectual situations in which pupils may be observed, the question 
naturally arises, which incidents or behaviors are worth being 
recorded and which are not significant as anecdotal records? The 
answer can come only from the application of some general concept 
of emotional and social adjustment in specific situations and from 
what is significant for any particular individual. Hard and fast 
rules or criteria for the selection of significant anecdotes are im- 
possible. Every teacher must formulate his own criteria of sig- 
nificance in terms of a general concept of adjustment and of a 
knowledge of the individual pupil, which must be based on past 
experience with him or on a cumulative record of evaluations of 
him in those areas or aspects with which these chapters are con- 
cerned. 

Anecdotes should include both objective description of behavior 
and interpretation of it. Traxler (29 :4) has furnished excellent il- 
lustrations of objective reports, subjective interpretations, and the 
recommendations occasionally to be desired. The first of the follow- 
ing paragraphs is an anecdotal record which combines report and 
interpretation in a confusing fashion. The second illustrates the 
report alone; the third the interpretation alone; and the fourth the 
recommendation. 


Combined report, interpretation, and recommendation —In a meeting 
of her club today, Alice showed her jealousy of the new president by 
firing questions at her whenever there was an opportunity. She tried to 
create difficulties by constant interruption throughout the period. The 
other students showed their resentment by calling for her to sit down. It 
is apparent that she is a natural trouble-maker, and I think her counselor 
should have her in for a serious talk, 

Incident—In à meeting of her club today Alice fired questions at the 
new president at every opportunity. She interrupted many times during 
the period. On several occasions the other students called for her to sit 
down. $ 

Interpretation —Alice seemed to be jealous of the new president and 
desirous of creating difficulty. The other students appeared to resent 
her action. The girl seemed to enjoy making trouble for others. 

Recommendation —It would be advisable for the counselor to lead the 
girl tactfully into a discussion of her relations with the other students in 
an effort to bring about better adjustment. 


es 


Adjustment 379 


The separation of subjective and objective material in the anecdote 
enables other teachers or counselors to understand it more accurately. 
Even when there is no separation, however, the anecdotal method 
can provide valuable data. 

Not only undesirable or problem behavior should be recorded but 
also desirable, inconspicuous, or even neutral but significant mani- 
festations of the ways in which a pupil is adjusting emotionally and 
socially to his problems. A rough indication of the general trend of 
his adjustment may be obtained if desirable anecdotes concerning 
him are designated in some way, such as a plus mark, and unde- 
sirable behaviors with some other designation. The amount and 
direction of the difference between the number of desirable and 
undesirable anecdotes over an interval of time might provide some 
roughly quantitative measure of the general trend of the pupil's 
behavior. 

A crucial requirement of any anecdotal technique is that it be 
cumulative. Many anecdotes should be collected for each pupil. 
Furthermore, as many different observers as possible should record 
anecdotes for each pupil in order to insure that the sampling of his 
behavior will be sufficiently broad to provide evidence that can be 
reliably interpreted. "This means, of course, that the value of 
anecdotal records as evidence of adjustment is, other things being 
equal, highly related to the sheer quantity of evidence recorded in 
terms of both the number of incidents and the number of different 
points of view represented in the accumulation of records. But 
even where only one teacher is in a position to observe a given 
pupil, his anecdotes will constitute valuable evidence concerning 
adjustment, especially if it is accumulated, summarized, interpreted, 
and made available to the pupil's future teachers or counselors. 

Traxler (29) has presented the general procedure for setting up a 
system for anecdotal records in six steps: (1) enlisting cooperation, 
(2) deciding how much should be expected of observers, (3) prepar- 
ing forms, (4) obtaining the original records, (5) central filing, and 
(6) summarizing. The cooperation of teachers should be based 
upon not only a willingness to experiment but a broad interest in the 
pupil and a desire for individualized education. The faculty of a 
school will seldom succeed with so broad an undertaking as an 
anecdotal record system unless it possesses these qualifications. 

Neither too much nor too little should be expected of the in- 


380 How to Evaluate 


dividual teacher; it should be agreed that every teacher will record 
no fewer than a certain reasonable number of anecdotes each 
week. What this number should be may probably be best deter- 
mined by having a trial week during which each teacher records 
as many anecdotes as possible. A meeting of the teachers can then 
be held to decide on the minimum for everyone and the first week's 
production can be criticized. This minimum will, of course, vary 
inversely with each one's teaching load. If possible, one anecdote a 
week for each pupil should be established as the minimum, but 
this may be increased or decreased to prevent any teacher from 
carrying an impossible load. At this point emphasis should be placed 
upon the variety of behaviors which are worth recording; that is, 
not only striking behavior but even such inconspicuous behavior 
as complete inactivity or staring into space may be significant of 
pupil adjustment or maladjustment. 

The form for the records may be very simple, such as a small 
card or a half sheet of paper that contains space for the pupil's and 
teacher's names, date, place, and anecdote. These original records 
may be organized and summarized periodically for each pupil on a 
similar but somewhat larger sheet, or they may be filed together if 
filing space permits. Still broader summaries may be prepared at the 
end of each year in an annual summary of each pupil's anecdotes. 

Obtaining the original record is a matter of taking "mental" 
notes at the time an incident occurs and from these notes writing 
the anecdote during a free period later in the day so as not to inter- 
fere with whatever activity is claiming the teacher's or pupil's at- 
tention at the time. Periodic checks should be made to insure an 
even distribution of anecdotes for all pupils, especially to prevent 
inconspicuous pupils from going unobserved. By the end of the 
semester the teacher should have several anecdotes for each pupil 
whom he has had an opportunity to observe. 

Recording anecdotes may easily take so much time as to cause 
the whole system to break down. The time burden may be reduced 
through some of the procedures suggested by Jarvie and Ellingson 
(10:61): (1) Providing centrally located dictaphones for use in 
recording anecdotes. Each teacher can be allotted certain times each 
week for using the dictaphone. (2) Assigning secretaries to mect 
teachers at specified times to take down anecdotes and transcribe 


Adjustment 381 


them for the central file. (3) Organizing weekly discussions of pupils 
by teachers and providing secretarial help to record the anecdotes 
brought forth. 

The central filing of anecdotes should bring together for summary ` 
and comparison the observations of each pupil over a period of time. 
Carbon copies may be retained by the individual teachers for their 
own files. The central filing place may be either the office of the 
pupil’s counselor, the pupil's home room, the principal's office, or 
wherever else they can be most easily used for guidance purposes. 
Anecdotes should probably be centrally filed once a week, but local 
conditions should be the determining factor. 

'The summarizing of anecdotes requires careful planning to pre- 
vent the abundance of records from causing the whole program to 
break down. If about fifty anecdotes for each pupil are collected 
during the school year in a well-established system, the task of 
examining this material, extracting meaningful patterns and trends 
of adjustment from it, and condensing it into a significant short 
summary easily becomes formidable. Periodic summaries clear the 
way for the annual summary, and they may be entrusted to clerical 
workers if they are merely chronological arrangements of briefed 
statements. The annual summary should, however, be made only by 
the teacher, counselor, principal, or school psychologist responsible 
for the pupil’s guidance. 

The accurate observation and recording of behavior incidents is 
the basis of the anecdotal record technique; without this it may 
easily do more harm than good. Practice in separating objective 
description from interpretative comment while refining the useful- 
ness of both to the utmost comes with the continued use of the 
system and usually results in continuous improvement. Precautions 
should be taken to prevent anecdotes from becoming defenses of 
teachers against pupils or mere expressions of emotional reactions 
to pupil behavior. Probably a wise rule is that no anecdote should 
be written by teachers about incidents to which they have reacted 
emotionally. Similarly, the social situations or context in which in- 
cidents occur may sometimes be essential to their understanding and 
should then be described along with the behavior itself. 

Furthermore, interpretations should be made only with due regard 
to the sufficiency of the number of anecdotes and with full realiza- 


382 How to Evaluate 


tion that whatever part of a pupil's behavior finds its way into anec- 
dotal records is only a small sample of his total significant behavior. 
Like all other data concerning pupil adjustment, these records 
should be considered confidential in order to prevent irresponsible 
and unqualified persons from using and interpreting them to the 
disadvantage of the pupil. Even more than the annual summary, 
the specific anecdotes themselves should be treated in general terms. 
Interpretations and recommendations based on the anecdotal record 
should be formulated in the light of the fact that adjustment is a 
long-time process in which it may be dangerous to attempt short 
cuts, 

Among the values claimed for anecdotal records are the variety 
and validity of the evidence they provide concerning adjustment, 
their specificity and exactness, their power to direct the attention of 
teachers to pupils as individuals, the continuity of the evidence they 
provide, their excellence as a basis for the teacher’s use of various 
rating devices, and their validity as criteria for the validation of 
other adjustment evaluation devices. 

In relation to other methods of adjustment evaluation such as 
self-inventories and rating devices, the role of anecdotal records 
may be both a supplement to and particularly a source of evidence 
for the use of rating scales. The space provided in such scales as 
that of the American Council on Education for “instances which 
support your judgment” is especially well used when filled out 
from the evidence contained in anecdotal records. The annual sum- 
mary of these records may well take the form of various rating 
scales. 

In an attempt to provide a mechanism for such summaries and to 
reduce somewhat the labor involved in anecdotal record systems, the 
Reports and Records Committee of the Eight-Year Study of the 
Progressive Education Association has devised a Behavior Descrip- 
tion form. This consists of a filing and transfer folder upon which 
are printed descriptions of five or six different categories for each 
of the following traits: responsibility-dependability, creativeness and 


imagination, influence, inquiring mind, open-mindedness, power and , 


habit of analysis, social concern, emotional responsiveness, serious- 
ness of purpose, social adjustability, and work habits: Each category 
of each trait is carefully described, as is apparent from Fig. 8. 


"uoz uondusxq Jomeqeg Wo1} adiooxq—g “OL 
*axaino ANY 
GzOGOHd A1LNViSNOO NAHM N3A3 5NDIVIMSQNO ANY 
a1x14MO2 Oi NO4n G3rizH 38 LONNYD :3' 1HISNOdS3HI 
"NOISIANSdnS QNY 
5NIGGONd H2nM HIIM ATNO N3HL ONV AlTn2i44IG MO 
NOLLYHNG 21vN300W 4O BUY ABHL N3HM AINO SONINVL 
"N3O0Nn 31334MO2 OL NO4n GRITSY 38 NVO '3' 1HYL T3H NO 


"SIN3NN9ISSY 

MSHLO H9nONHI ANNY2 OL AT3NI] 5931 S! ing ‘isa 

-NALNI UVINDLLEVd SI SHBHL HOIHM NI SONIXVINSOND NI 
BONBISISUZd HDIH SMOHS :318YGN3d30 AT3AI123 738 

“NOISINGNOD NO H3ONIW3H 1YNOISY22O ANNO ONININOSY 

“SUBHLO AG Q3N5ISSY HO Q3MnSSv-4135 'SONDIVINSOND 
HONOUHL SZIUMYD ATIVNSN :318YON3d3G ATIVHAN39 

"SIN3MNOISSY 4O 34028 AHL 39NVNZ 

OL AT33I1Nn S! ANG G3NDISSY S! N3ASIYHA NOISINd 
-WOD 1vNH31X3 LNOHLIM S31314WO2 :SNOILNIIOSNOD 


"SSNINVANSONO NOdN 
DNIDUVINS GNY S9NIHSI1dWO22Y NI ALITILVSHSA QNY 
BALLYLLINI SMOHS OSTY ONV 'N3XVINS3OND S! NH3AJIVHA 
HonosHi sanmmvo :1043280053H ANY 3 181SNOdS3H 


Zi savaa |i aavas|orsavus s saveo | s savus | 2 sovuo | =u AirmavaNsdsa—AlriereNOdS3M 
*Ayyouosiad 40 ‘pjay "uoupuo» 4p[n2uiod euros oi esuodsas D sejD5 

-ipur 4 esnp2oq 10 ^juaupnl 104 sisoq Jejaq D jo esnd2eq e»upyiuBis J316 spy uyo uou yuawBpnl peipjosi uo 

upuj jupjioduir sow eq you Ápui Aayy "jojAbqeq uouiuioo psow s juepnis D Moys Aow uoudissep ui sjuewees6p ayy (£) 


- *p10221 oui Ái 
p349402 poued əy; Buunp Jo1anyag ut Ajinuguos 40 souy? eui moys ^|p2i&ojouoau» Buraq “44611 01 49] s conde a (z) 
*j9ous 1940 pepjoj eui jo doy əy; 1D uaAtB sı Ko ejojduio» y "juepnis əy; PUD SIBAIASqGO eui 
ueeAjeq suoupje1 eui Ayiuep! O4 2epao ur Buipio2e4 eui ur pesn 310 spjey Ayayop 40 palgns jo sjpurur eui [p2eueB uy (1) 
:suoneutq 
*sBupjpuepun pup sayiunyoddo spjnayiod 10} sseuj siy jo PUD ^pequosep 
uosied eui jo Buipupisiepun up ways wo.y 426 0; jdurego pup suondu»sep eui ppe1 pjnous euo poeisu| *Buyps p sp pejeud 
-1ejui eq jou pjnoys j| 'speub jubiiodui jo zequinu o ur juspnis Əy} jo JOLADYEG 2listiejppapu» eui sequosop uode: siut 


moan E BWYN asv 


384 How to Evaluate 


The manual which is provided supplements the description of 
each category with further discussion and illustration and provides 
detailed directions for use. It might be desirable, in organizing a 
program of anecdotal records in a school, to consider the desirability 
of instructing teachers in the meaning of the traits and categories 
developed for the Bebavior Description plan. These might then be 
used as points of view or concepts around which the observation of 
pupils might be organized. 

The following are sample anecdotes used by Traxler (29 : 25) 
and by Jarvie and Ellingson (xo : 30) to illustrate typical products 
of the anecdotal record technique. 


Oct. 4. John came to my room and studied during the lunch hour 
today. When I asked him if he wasn't going to lunch, he said that he 
had to prepare his history lesson for the next hour, but I learned after- 
ward that he has had no money for lunch for several days. 

Nov. 15. Incident.—Each Wednesday at 2:30 Jane comes to me for 
special help in reading. Today was warm and the door to the corridor 
was open. Several times during the hour, persons walked along the 
corridor past the door. Every time this happened, Jane stopped reading 
and looked to see who it was, even though she was not facing the door. 

Comment.—This behavior is typical of this pupil. She seems to have 
a very short span of attention and is apparently unable to keep her mind 
on anything very long. Undoubtedly this partly explains her reading 
difficulty, although it is also a symptom, since it shows that she has no 
interest in reading (29 : 25). 


1. Objective statement of behavior with instructor's personal interpreta- 
tion of that behavior in view of the situation in which it occurred: 

lllustration.—Rubbed eyes constantly today; seemed tired and con- 
siderably irritated when asked to contribute to class discussion. This is 
unusual, since he ordinarily takes a very active part in the work of the 
group. I am not sure whether this is the result of lack of rest or a case of 
eyestrain which should be referred to the medical office. 

2. Objective statement of behavior with interpretation and record of 
treatment: 

Ilustration—In conversations with Henry he stated that he felt that 
he was not doing himself justice because he did not know how to take 
part in discussion effectively. I am sure that he is interested in his work 
and is in the habit of following the trend of thought in the discussion, 
but needs some help in making an attack on the problem. I advised him 


—— D" 


Adjustment 385 


to come to class each day well prepared on one or two points and told 
him that I would help him bring out these points in each day's dis- 
cussion (10 : 30). 


SuMMARY 


Guidance requires evaluation of the emotional and social adjust- 
ment of pupils. Among the difficulties of such evaluation are those 
of obtaining frank and insightful responses, determining the dimen- 
sions of adjustment, individualizing interpretation, and securing in- 
dependent criteria for validation. The uses, limitations, and avail- 
ability of four major classes of evaluation techniques are considered 
in detail, Self-inventories, essentially devices for self-rating, require 
good rapport between teacher or counselor and pupil unless in 
highly disguised form. Rating methods for the systematic recording 
of opinions or judgments concerning people are subject to various 
kinds of error which may be reduced by training raters and con- 
structing the devices according to certain suggestions. Conduct tests 
are typically too cumbersome for practical use in adjustment evalua- 
tion because their specificity makes a large number necessary for 
reliable measurement, Tests of knowledge and judgment in ethical 
problems similarly lack practical value as a rule because of their 
low correlations with ethical performance. Anecdotal records and 
behavior descriptions provide overall techniques for evaluating 
emotional and social adjustment. The steps in initiating and carry- 
ing out an anecdotal record program are described and solutions to 
various administrative problems are suggested. 


QUESTIONS 


1. Examine a representative self-inventory of the type discussed in this 
chapter. Select those items to which pupils would probably not re- 
spond frankly unless a high degree of rapport existed between them 
and the person who was evaluating adjustment. Select those items 
to which they might not respond in valid fashion because of lack 
of insight into the truth about themselves. Defend your selections. 

2. Construct a rating device for several aspects of college students. 
Have one or more students rated on this device by several of your 
classmates. Note agreements and disagreements between the ratings. 
How is the magnitude of the disagreements affected when the 


386 How to Evaluate 


ratings are grouped by 2's, 3's, 5's, and so on, and the average ratings 
of such groups are compared? What factors affect the disagreements 
between ratings given one student by two or more raters? What are 
the implications of your conclusions on techniques of evaluating 
adjustment through ratings? 

3. What methods can you devise to determine whether the training of 
raters has had the desired effect in reducing errors in rating? 

4. Compose a list of situations or performances in which some trait, 
such as tactfulness or generosity, would be manifested. Would you 
expect to be able to predict behavior in some or all of these situations 
from a pupil's behavior in others? What are the implications for the 
evaluation of social adjustment through performance tests? 

5. List the independent or external criteria which you can devise for 
a given self-inventory. Arrange them in order of their own validity 
and defend your arrangement. Do the same for the applicability or 
practicability of these criteria. 

6. Behavior anecdotes may be biased at both the time the behavior is 
observed and at the time the anecdote is recorded or written. Discuss 
ways of eliminating or reducing the errors due to these sources 
of bias. 


REFERENCES 


1. Bennett, G. K., “A simplified scoring method for the Bernreuter 
Personality Inventory," Journal of Applied Psychology, 22: 390- 
394 (1938). 

2. Bernreuter, R. G., "The present status of personality tests," Educa- 
tional Record, Supplement No. 13, pp. 160-171 (1940). 

3. Bogardus, E. S., "Measuring social distance," Journal of Applied 
Sociology, 9 : 299-308 (1925). 

4. Brandenburg, G. C., and Remmers, H. H., Manual for the Purdue 
Rating Scale for Instructors, Lafayette, Ind.: Lafayette Printing Co., 
1928. 

5. Pangan J. C., Factor Analysis in the Study of Personality, Stan- 
ford University: Stanford University Press, 1935. 

6. Flanagan, J. C., “Technical aspects of multi-trait tests,” Journal of 
Educational Psychology, 26 : 641-651 (1935). 

7. Guilford, J. P. and R. B., “Personality factors S, E, and M, and 
their measurement," Journal of Psychology, 2 : 109-127 (1936). 

8. Hartshorne, H., May, M. A., and Maller, J. B., Studies in Service and 
Self-Control, New York: The Macmillan Company, 1929. 


II. 


I2. 


13. 


14. 


15. 


16. 


17. 


18. 


19. 


20. 


2I. 


22. 


Adjustment 387 


Hathaway, S. R., “The Personality Inventory as an aid in the 
diagnosis of psychopathic inferiors," Journal of Consulting Psychol- 


ogy, 3 : 112-117 (1939). 


. Jarvie, L. L., and Ellingson, Mark, 4 Handbook on the Anecdotal 


Behavior Journal, Chicago: University of Chicago Press, 1940. 
Jarvie, L. L., and Johns, A. A., “Does the Bernreuter Inventory 
contribute to counseling?” Educational Research Bulletin, 17 : 7-9 
(1938). 

Moreno, J. L., Who Shall Survive? A New Approach to the Prob- 
lem of Human Inter-Relations, Washington: Nervous and Mental 
Disease Publishing Company, 1934. 

Murphy, G., and Likert, R., Public Opinion and the Individual, 
New York: Harper & Brothers, 1938. 

Olson, W. C., Problem Tendencies in Children, a Method for Their 
Measurement and Description, Minneapolis: University of Minnesota 
Press, 1930. 

Peterson, Ruth A., “The validity of the Bell Adjustment Inventory 
when applied to college women,” journal of Psychology, 9 : 227-236 
(1940). 

Pintner, R., and Forlano, G., “Validation of personality tests by out- 
standing characteristics of pupils,” Journal of Educational Psychol- 
ogy, 30 : 25-32 (1939). 

Robertson, C. A. (chairman), Subcommittee on Personality Measure- 
ments, “Personnel methods,” Educational Record, Supplement 
No. 8, pp. 53-68 (1928). 

Rogers, C. R., Manual of Directions, a Test of Personality Adjust- 
ment, New York: Association Press, 1931. 

Rogers, C. R., Measuring Personality Adjustment in Children 9 to 
13 Years of Age, New York: Bureau of Publications, Teachers Col- 
lege, Columbia University, Contributions to Education No. 458, 
1931. : 

Saint Clair, W. F., and Seegers, J. C., Certain aspects of the validity 
of the F scores on the Bernreuter Personality Inventory,” Journal of 
Educational Psychology, 29 : 301-311 (1938). 

Sheviakov, G. V., and Friedberg, Jean, Evaluation of Personal and 
Social Adjustment: A Report of Progress of the Study, Chicago: 
Evaluation in the Eight-Year Study, Progressive Education As- 
sociation, 1939. 

Speer, G. S., “The use of the Bernreuter Personality Inventory as 
an aid in the prediction of behavior,” Journal of Juvenile Research, 


20 : 65-69 (1936). 


388 
23 


24. 


25. 


26. 
27. 
28. 


29. 


How to Evaluate 


Spencer, Douglas, The Fulcra of Conflict, a New "Approach to Per- 
sonality Measurement, Yonkers: World Book Company, 1939. 
Stagner, R., “The validity and reliability of the Bernreuter Per- 
sonality Inventory," Journal of Abnormal and Social Psychology, 
28 : 413-418 (1934). 

Stogdill, Emily L., and Thomas, Minnie E., “The Bernreuter Per- 
sonality Inventory as a measure of student adjustment,” Journal of 
Social Psychology, 9 : 299-315 (1938). 

Super, D. E., “The Bernreuter Personality Inventory: a review of re- 
search,” Psychological Bulletin, 39 : 94-125 (1942). 

Sweet, L., Measurement of Personal Attitudes in Younger Boys, 
Occasional Studies, No. 9, Association Press, 1929. 

Symonds, P. M., Diagnosing Personality and Conduct, New York: 
D. Appleton-Century Company, Inc., 1931. 

Traxler, A. E., The Nature and Use of Anecdotal Records, Sup- 
plementary Bulletin D, New York: Educational Records Bureau, 


1939. 


CHAPTER XVII 


Attitudes and Related Aspects 


i 
H 
H 
td 


IN CHAPTER VI WE DISCUSSED THE IMPORTANCE OF ATTITUDES IN THE 
integration both of individuals and of the societies in which they 
live. A general definition of attitudes was offered, the relationships 
of attitudes to other allied concepts were indicated, and the prob- 
lems of the organization and determiners of attitudes were discussed. 
The chapter concluded with a presentation of attitudes considered 
to be significant for educational guidance. 

First we may point out a distinction between two functions of 
attitude measurement. In the first place, the attitudes of pupils may 
be evaluated as educational outcomes, as indications of the degree to 
which pupils have acquired certain attitudes set up as objectives of 
instruction. Such attitudes are most often included in statements 
of objectives of the social studies. But, of course, attitudes may be 
and have been considered as objectives of instruction in all areas of 
the curriculum, including mathematics, natural science, art, and 
language studies. Secondly, attitudes may be evaluated as part of 
the attempt to predict the adjustment of pupils to various school 
curricula and vocations. Students’ attitudes toward school subjects 
and vocations are often termed their educational and vocational 
“interests” or “preferences,” but the difference in terminology should 
not obscure the fundamental psychological equivalence of these 
aspects of pupils to the concept of attitudes. 

This distinction between the two functions of attitude measure- 
ment for pupils arises from the difference in the degree to which 
the attitudes measured are set up as desirable goals for all pupils, 
that is, as instructional objectives. Attitudes toward curricula and 
vocations are not established as goals; pupils may fully achieve in- 

389 


390 How to Evaluate 


structional objectives and yet differ widely among themselves in 
their attitudes, preferences, or interests in various curricula and 
vocations. It may be considered desirable, for example, for all pupils 
to acquire certain attitudes toward democratic principles, but for 
all pupils to acquire certain attitudes toward the "technical" cur- 
riculum in high school as distinguished from the "academic" or 
the "commercial" curriculum is not set up as a goal of educational 
endeavor. 

In the present chapter we shall consider first the techniques which 
have been devised and made available for measuring attitudes as in- 
structional objectives and then the techniques for measuring attitudes 
as indicators of curricular and vocational adjustment. 


ÅTTITUDES as EDUCATIONAL OUTCOMES 


Approaches to the Evaluation of Attitudes—Techniques for the 
evaluation of attitudes may be grouped under headings similar to 
those employed in the preceding chapter for adjustment evaluation. 
That is, self-inventories or questionnaires; rating scales; tests of 
conduct, knowledge, or judgment; and anecdotal records may all 
be used if the appropriate modifications are made. Of these tech- 
niques, the rating scales and the anecdotal records are least changed 
in form when applied to attitude evaluation. In place of traits or 
modes of adjustment, the rating scale becomes a series of statements 
of attitudes or points of view on which the pupil is rated by his 
teachers or fellow pupils along some continuum such as agree- 
disagree, The anecdotal record similarly can be used when the 
scope of the behaviors considered significant is broadened so as to 
include those indicative of attitudes toward various psychological 
objects as well as those indicative of emotional and social adjust- 
ment, It is in the inventory and questionnaire technique of evaluat- 
ing attitudes that the greatest departure from the techniques of 
evaluating other aspects of pupils has been made, For this reason 
the major part of our attention will be centered on this general 
method. Let us now turn to the techniques which have been used 
for the construction of inventories of attitudes. Since these inven- 
tories are usually called scales or questionnaires, this terminology 
will be used in the discussion. 

In general, the techniques used for the construction of attitude 


Attitudes and Related Aspects 391 


inventories have been of two kinds: (1) the equal-appearing-in- 
tervals method, and (2) the simple summation of responses indicat- 
ing attitude in a given direction. 

Attitude Scales.—The first of these methods, the equal-appearing- 
intervals technique, was mentioned in Chapter XI in connection 
with product evaluation. It proceeds by the following steps (29): 
First a large number of statements expressing opinions concerning 
an attitude object are collected. All of them must express an attitude 
toward (a feeling for or against) a single object—the policy, prac- 
tice, institution, group, or anything else, the attitude toward which. 
is to be measured. Mere statements of fact concerning the at- 
titude object are not suitable, nor are opinions which do not carry a 
connotation of favorableness or unfavorableness. The statements 
in this initial collection should range along the entire continuum 
from extreme favorableness through neutrality or indifference to 
extreme unfavorableness. It goes without saying that they should he 
unambiguous, that is, subject to only one interpretation. Further- 
more, they should not be “double-barreled” or express more than 
one complete thought, since in this case a person may agree with 
one part of the statement and disagree with the other part. 

After the initial group of statements has been collected and edited, 
they are given to a group of 100 or more judges to be sorted into 
eleven piles or rated on a scale of eleven points which represent 
equal steps from the most favorable extreme through neutrality to 
the most unfavorable extreme of attitude toward the particular ob- 
ject. Only the two extremes and the middle point are defined as 
such; the intermediate piles are left to the individual sorters for 
definition, the only restriction being that they represent equal in- 
tervals along the range of attitude. This freedom is given the judges 
to insure that the resultant sorting will represent equal intervals 
in their opinion and not intervals determined by whoever applies 
descriptive terms to the various piles of statements. 

Then the scale value, or the average (median) of the ratings along 
the favorable-unfavorable continuum, is determined for each state- 
ment. Some of the statements, however, will be sorted more uni- 
formly by the judges than other statements. That is, the amount of 
scatter (quartile deviation) of the judgments will differ for the 
various statements. The ambiguity of a statement in expressing a 


392 How to Evaluate 


given degree of favorableness or unfavorableness toward the attitude 
object may be measured by this spread. The ambiguous items are 
thus eliminated. 

After the scale values of all the statements in the preliminary 
collection have been determined and only the unambiguous and 
relevant ones have been selected, the number of statements can be 
reduced so as to include in the final attitude scale only those which 
represent fairly even steps along the entire continuum of favorable- 
ness-unfavorableness. The scale is administered by asking subjects 
to place a check mark after all the statements they endorse as an 
expression of their own sentiment, opinion, or attitude. The subject’s 
score on the scale is the average scale value of the opinions that 
he has endorsed. 

Obviously the amount of labor required in constructing a scale 
to measure attitude toward any psychological object is so great that 
it would be impossible to build scales for all possible significant at- 
titude objects. Remmers (22) has attempted to overcome this 
difficulty to some extent by developing generalized or master atti- 
tude scales which can be used to measure attitude toward any one of 
a class of attitude objects, such as school subjects or vocations. 'The 
statements in the general attitude scale are not related specifically 
to any single attitude object, but if the name of the appropriate ob- 
ject is filled in at the head of the scale, they can be interpreted 
meaningfully for any representative of the class of objects for 
which the scale is intended. The scale values of Remmers' scales 
are determined by Thurstone’s equal-appearing-intervals technique 
whereby large numbers of preliminary statements are rated on an 
eleven-point scale from complete unfavorableness through neutrality 
to complete favorableness. 

Both Thurstone and Remmers have directed the construction of a 
large number of attitude scales for various objects. The scales which 
these workers have made available are as follows: 


Thurstone Scales for the Measurement of Attitude Toward 


God (E. J. Chave and L. L. Thurstone) 
War (D. D. Droba) 

The Negro (E. D. Hinckley) 

The Law (D. Katz) 

Capital Punishment (R. C. Peterson) 
The Chinese (R. C. Peterson) 


Attitudes and Related Aspects 393 


‘The Germans (R. C. Peterson) 

War (R. C. Peterson) 

Censorship (A. C. Rosander and L. L. Thurstone) 

The Constitution of the United States (A. C. Rosander and L. L. 
Thurstone) 

Prohibition (H. N. Smith and L. L. Thurstone) 

Patriotism (M. B. Thiele and L. L. Thurstone) 

Communism (L. L. Thurstone) 

Evolution (T. G. Thurstone) 

The Church (L. L. Thurstone and E. J. Chave) 

Immigration (L. L. Thurstone) 

League of Nations (L. L. Thurstone) 

Free Trade (L. L. Thurstone) 

Monroe Doctrine (L. L. Thurstone) 

German War Guilt (L. L. Thurstone) 

The Bible (L. L. Thurstone and E. J. Chave) 

Economic Position of Women (L. L. Thurstone) 

Foreign Missions (L. L. Thurstone) 

Divorce (L. L. Thurstone) 

Freedom of Speech (L. L. Thurstone) 

Social Position of Women (L. L. Thurstone) 

Honesty in Public Office (L. L. Thurstone) 

Preparedness (L. L. Thurstone) 

Public Ownership (L. L. Thurstone) 

Unions (L. L. Thurstone) 

Public Office (L. L. Thurstone) 

Birth Control (C. K. A. Wang and L. L. Thurstone) 

Sunday Observance (C. K. A. Wang and L. L. Thurstone) 

The Treatment of Criminals (C. K. A. Wang and L. L. Thurstone) 

The Movies (L. L. Thurstone and R. C. Peterson) 


Remmers Master Attitude Scales for the Measurement of Attitude Toward 


Any Disciplinary Procedure (V. R. Clouse) 
Any Elementary Teacher (M. Amatora) 

Any Home-making Activity (B. K. Vogel) 
Any Play (M. Dimmit) 

Any Practice (H. W. Bues) 

Any Proposed Social Action (D. M. Thomas) 
Any Racial or National Group (H. H. Grice) 
Any School Subject (E. B. Silance) 

Any Social Institution (I. B. Kelley) 

Any Teacher (L. B. Hoshaw) 

Any Vocation (H. E. Miller) 


394 How to Evaluate 


The Remmers master scales are illustrated by the excerpt of 
the Kelley-Remmers Scale for Measuring Attitude Toward Any 
Institution shown in Fig. 9, Whereas in the Thurstone scales the 
statements are arranged in random order, in the Remmers scales 
they appear in order of decreasing favorableness. This arrangement 
has been found (22 : 16) to decrease greatly the time and labor re- 
quired for scoring without affecting the accuracy of the measure- 
ment, 

The validity of the master scales has in many instances (22) been 
determined in terms of their correlations with comparable specific 
scales in the Thurstone series, These correlations have in general 
been sufficiently high to warrant the conclusion that both types of 
scale measure essentially the same attitude. For instance, the Kelley- 
Remmers Scale to Measure Attitude Toward Any Institution when 
used to measure attitudes toward communism yielded scores which 
correlated almost perfectly with those obtained by the Thurstone 
Scale for the Measurement of Attitude Toward Communism. 

Attitude Questionnaires.—Before the equal-appearing-intervals 
technique was applied to attitude scales the simple questionnaire 
was most widely used to measure attitudes, Attitude questionnaires 
are collections of statements or questions to which an individual 
responds yer or no. For some of the statements a yer response in- 
dicates attitude in a certain direction while for others a no response 
indicates this direction. The statements are not scaled as to in- 
tensity or degree of favorableness-unfavorableness. Rather a meas- 
ure of degree is obtained by adding all the responses, yes or no, 
which indicate attitude in a given direction, The greater the number 
of statements of one type with which a person agrees, and the greater 
the number of statements of the opposite type with which he dis- 
agrees, the more favorable is his attitude. 

Illustrative of such questionnaires is the Harper Social Study (11) 
developed in 1925 to measure the liberalism-conservatism of teachers 
and college students. As in most other attitude questionnaires, the 
responses which indicated a given attitude were determined by con- 
sultation with authorities on the various issues presented. 

Wrightstone (32) has constructed a Scale of Beliefs which cm- 
ploys the questionnaire procedure. His questionnaire, a series of 
statements concerning racial, international, and national-political 


A SCALE FOR MEASURING ATTITUDE TOWARD ANY INSTITUTION 
ida B. Kelley Edited by H.H. Remmers 
Form A 


Please fill in the blanks below. (You may leave the space for your name 
blank if you wish.) 


Name. 
Male Female (encircle one) Date. 
Age ———— — Cass if in — 


Flin tof tort bo tiny Place a plus sign (+) 


[ | [| [| [u.s necessary to the very existence of society 
|| | f f fne encourages socat improvement, 
15. Serves society as a whole well, 


mere 
[ | | [| [| ]te e te inciviuat in wise use of ieisure time, 
Fic. 9.—Excerpt from Remmerstype attitude scale. 


396 How to Evaluate 


issues and attitudes toward national achievements and ideals, in- 
cludes an equal number of expressions characteristic either of 
liberals or of conservatives. The statements were validated (1) by 
comparison with a list of beliefs represented in textbooks com- 
monly used in American schools, (2) by a statistical study of the 
ability of each statement to discriminate between liberals and con- 
servatives, (3) by the rankings given by social scientists to the 
statements in order of liberalism or conservatism, and (4) by de- 
termination of the degree to which the liberal statements agreed 
and the conservative statements disagreed with expressions of 
editorial opinions in liberal journals such as the Nation and the 
New Republic. The student marks either yes or no for each state- 
ment and his score for liberalism or conservatism is computed 
directly from the number of liberal or conservative statements with 
which he agrees. 

Scales and Questionnaires Compared.—It is evident that one dis- 
tinction between questionnaires and scales is the refinement which 
the latter possess as to the degree of favorableness represented by 
each statement, and consequently the method of scoring, Since 
scale statements are given values differing in magnitude as well as 
direction, the score is the average magnitude which is endorsed. A 
questionnaire score, on the other hand, is simply a summation of 
the responses in a given direction, each response being equally 
weighted for lack of scaling. A major difference between the ques- 
tionnaire and the scale is the fact that questionnaires require norms 
for interpretation of the scores whereas scales do not because they 
incorporate such norms by virtue of the procedure used in their 
construction. 

The measures provided by questionnaires suffer from certain 
theoretical disadvantages. In the first place, each statement is 
equally weighted, although some of them probably indicate a far 
greater degree of liberalism or conservatism than others, It is im- 
possible to say, for example, that a score of 75 indicates as much 
greater liberalism than a score of 50 as a score of 50 indicates a 
greater liberalism than a score of 25, although this is a justifiable in- 
terpretation from scores on attitude scales constructed by the equal- 
appearing-intervals technique. In the second place, such question- 
naires include a large variety of statements indicating attitudes 


Attitudes and Related Aspects 307 


toward a great many things on the assumption that, since they 
all differentiate between liberals and conservatives, a general factor 
of liberalism or conservatism runs throughout the questionnaire. 
Thus it is difficult to interpret in specific terms what a score on 
such a scale means since the attitude object is so generally and 
vaguely defined. With attitude scales constructed by Thurstone’s 
technique, or with Remmers’ master attitudes scales, the attitude is 
strictly and unambiguously delimited in terms of the single attitude 
object or class of objects the attitude toward which is measured. 
Thus, one of the statements in such a scale as Wrightstone’s may 
deal with the Negro, another with the Chinese, another with Ger- 
mans, and still another with Russians; all these attitudes are then 
grouped together as a class of racial attitudes without any evidence 
that such an overall grouping is justified for a given individual. 
By the Thurstone or the Remmers approach, attitude toward each 
of these nationalities and races would be measured separately and 
no assumptions concerning their interrelationship would be made. 
It is probable that a clearer, more understandable picture of a pupil’s 
attitudinal make-up can thus be obtained. 

On the other hand, wherever it is desired to assume such general 
overall attitudes as liberalism-conservatism, the questionnaire ap- 
proach exemplified by the Wrightstone scale can furnish valuable 
results. This is especially true when such general attitudes con- 
stitute an objective of school instruction. 

Validity of Scales and Questionnaires.—Various technical issues in 
attitude scale construction and interpretation have been raised by 
such writers as Likert (16), Stagner (24), Corey (3), Hinckley (12), 
and many others. The present discussion will not examine the details 
of these arguments; for a critical review of the technical literature 
on attitude scales the reader should consult Newcomb (18 : 889-912) 
or Beery and Bare (x : 1-17). The validity of the attitude measuring 
devices here discussed has been much disputed. 

The arguments for the validity of attitude scales and question- 
naires are based, first, on the logic of their construction, and second, 
on the correlation between their results and overt commitments of 
various kinds, such as membership in certain organizations, votes 
on elections (e.g., the Gallup poll has been validated by national elec- 
tions), or other acceptable criteria. Numerous studies have validated 


398 How to Evaluate 


attitude scales and questionnaires in terms of such criteria. Other 
studies, failing to find correlations with certain criteria, have enabled 
a closer definition of what is measured by these scales and ques- 
tionnaires. 

It is sometimes maintained that attitude measurements should 
serve as predictors of behavior and that any set of measures which 
does not correlate highly with actual behavior toward the attitude 
object should be considered invalid. The situation here is much the 
same as that discussed in the preceding chapter in connection with 
tests of knowledge and judgment of ethical matters. It will be re- 
called that, in general, there was a very low correlation between 
such tests and the actual performance of individuals in activities 
to which the knowledge or judgment might be considered related. 
Thus pupils might know the rules of honesty and be able to apply 
them in verbally expressed situations without necessarily being 
themselves honest in those situations. In the same way, various 
studies have shown that measures of attitudes may fail to predict 
what pupils will do in actual situations. 

On the other hand, it is maintained that the failure of attitude 
scales to meet this kind of test does not mean that they are without 
value. For if the responses they elicit can be consideréd adequate 
measures of the verbal form which “feelings for or against some- 
thing” may take, they are in themselves valuable regardless of their 
relationship to “actual behavior.” An attitude is seldom the sole 
determiner of behavior even in those situations to which the atti- 
tude seems most closely related. Other attitudes may work at cross 
purposes with the attitude which has been measured so that the 
resultant effect upon behavior is not what would be expected from 
knowledge of the single attitude. 

Furthermore, the verbal expression of an attitude frequently is 
also the most fundamentally important expression of that attitude 
for the purposes of living. Thus, what indication of one’s attitude 
toward the League of Nations would be more adequate than what 
one says about the League of Nations? Our most practical behavior 
toward such an agency or institution could hardly be other than 
verbal, Our attitudes toward candidates for political office are, of 
course, measured for the practical purposes of government by 


Attitudes and Related Aspects 399 


means of the crude two-point, yes-no, attitude scales known as 
election ballots. 

Attitude scales and questionnaires face the same difficulty as 
do the self-inventories used for the evaluation of adjustment. That 
is, individuals may not be willing to be frank and honest about 
their attitudes. This difficulty may be overcome either by estab- 
lishing a high degree of rapport between the individual whose at- 
titude is being measured and the person measuring it or by dis- 
guising the attitude scale so that he will not realize that he has any- 
thing to conceal. The necessity for rapport becomes evident when- 
ever there are socially defined “right” and “wrong” attitudes toward 
a particular attitude object in a particular situation. Pupils living in 
an open-shop coal mining town completely owned and operated 
by anti-union coal mine operators would be wary of expressing 
attitudes explicitly favorable to trade unions unless their responses 
could be made anonymously or they had complete trust that their 
teacher would keep their answers strictly confidential. These pupils 
would have grown up in a section where disapproved attitudes con- 
cerning such an issue could easily cost the family breadwinner his 
job; hence they would probably be extremely cautious in expressing 
their attitudes. 

The second method of eliciting valid responses, disguising the 
attitude scale, depends upon a distinction between “stereotypes” and 
the clusters of opinions and behaviors underlying such “stereotypes.” 
An attitude toward a stereotype such as “fascism” is the individual's 
conscious response to statements containing that word. Stagner (24) 
has constructed an instrument to measure fascist attitudes in which 
the term does not appear but in which such components of fascism 
can be expressed as opposition to labor unions, to various national 
and racial groups, to democratic procedures, and to freedom of ex- 
pression for radical groups. He found that people who were dis- 
tinctly opposed to the stereotype “fascism” might still be favorable 
to many of the policies and practices which fascism directly implies. 
Thus the disguised attitude scale was able to reveal attitudes which 
an attitude scale employing stereotypes would have concealed, 

One criterion of attitudes which has frequently been used for the 
validation of attitude scales is membership in groups and organiza- 


400 How to Evaluate 


tions of known attitudinal make-up. Thus Thurstone and Chave's 
Scale for the Measurement of Attitude Toward the Church was 
able to differentiate clearly between divinity students and other 
college students, between active and inactive church members. 
Similarly, scales designed to measure attitudes toward the Negro 
have been found to yield very different average scores when filled 
out by northern and by southern university students. Attitude scales 
toward war and prohibition and liberalism and conservatism in 
political affairs and many other psychological objects have been 
validated in this fashion. Stouffer (25) used as a criterion for the 
validity of his prohibition attitude scale a set of thousand-word 
compositions written by students concerning their feelings and ex- 
periences from childhood on as related to prohibition and drinking 
liquor. Ratings of the composition by four judges who read each 
case history and scored it for attitude on a graphic rating scale cor- 
related highly with scores on the attitude scale which Stouffer 
constructed by the Thurstone method. The attitude scale and the 
compositions were completely anonymous, being matched solely by 
means of code numbers. 

Another technique which has been considered to provide a valida- 
tion of attitude scales is that of determining whether the scales 
reflect the changes in attitude to be expected when individuals are 
subjected to experiences, such as lectures or movies, which are de- 
signed to have a strong effect on attitudes in a given direction. 
Thus Thurstone and Peterson (2) administered a scale of attitudes 
toward the Chinese to a group of junior and senior high school 
pupils both before and after they were shown a movie designed to 
shift attitudes toward the Chinese in a favorable direction. The 
pupils’ attitudes toward the Chinese as reflected in their average 
scores on the scale became distinctly more favorable, as might be 
expected from the nature of the experience to which they were sub- 
jected; this may be considered as evidence for the validity of the 
method of measuring attitudes. 

Kelley (14) validated the master Scale for Measuring Attitude 

` Toward Any Institution in terms of the differences between scores 
for attitude toward Sunday observance obtained by Seventh Day Ad- 
ventists, whose actual conduct and overt commitments led to the 
expectation of low (unfavorable) scores, and by groups of Meth- 


Attitudes and Related. Aspects 401 


odists, Baptists, and United Brethren. Fig. 10 shows that the attitude 
scale discriminated almost perfectly between these two groups. The 
correlation with the specific Thurstone Scale for Attitude Toward 
Sunday Observance was .83, which indicates the similarity of results 
obtained with master and specific scales. 

Similarly striking validation of a master scale, the Remmers- 
Thomas Scale for Measuring Attitudes Toward Any Proposed Social 


—— OTHER CHURCHES N-e 
1 \ -—-— SEVENTH DAY ADVENTISTS 
N-105 


PERCENTAGES 


40| 


I 2 3 4 5 7 8 3 w u 


Fic. 10.—Attitude toward Sunday observance. 


Action, was obtained by Peters and Peters (19) in their study of the 
effect of pupil self-government on attitudes toward law enforce- 
ment. Two groups of pupils, one of which participated in a thor- 
oughgoing scheme of pupil government of school while the other 
had no self-government, were asked to indicate on the master scale 
their attitudes toward the judge’s decisions in ten actual cases of 
law violation. The cases were selected to exemplify the desires to 
gain wealth, to save the life of a member of the family, to gain or 
keep friends, to improve the family’s living conditions, and violation 
through ignorance of the law—all of these being considered in- 
stances of the principal reasons for children’s disobedience of law. 
It was found that the attitudes toward law enforcement of the pupils 
who were experiencing self-government were significantly more 
favorable than those of pupils without self-government. 

The persistence of attitude changes effected by instructional ac- 
tivities has been studied by means of master attitude scales. Hall (xo) 


3 


9 


f 


~a 
x, 
/ AU, cm um UE S GUN NUS UND NNUS NND REND SE E 


: 
3| / 


é 
fre- 4 60 Days 180 Doya 
Persistence of Athtude— Capital Purushment 
WK 
Pe, 
/ “Samm 
E 6 i ee Ps a s es ATL 
* I 
5| 4 
5 


Pre-test / tou = Daye zz) Days 


Rersistence of Attitude — Negroes 


Fic, 11.—'The persistence of attitude changes. 


Attitudes and Related Aspects 403 


found that the attitudes of pupils toward social insurance, capital 
punishment, labor unions, and Negroes could be changed by means 
of articles read to them by the teacher. The changes were in the 
directions expected and, as is shown in Fig. 11, tended to persist over 
‘a period of at least six months with little change after an initial 
‘regression. Similar results were obtained by Williamson and Rem- 


pew nn ee eee ee eee 


V ^ 


Pre-test 1 day 1 month 4 months 8 months 
(Post-test) (October test) (January test) (May test) 


Fic. 12.—The persistence of attitude changes. 


mers (3r) with the attitudes of rural and urban pupils toward 
such conservation issues as allowing the government to tell the 
farmer how to farm, allowing each farmer to farm as he pleases, 
clean farming, taxing all the people to plant new forests, and 
draining swamps. Typical results are shown in Fig. 12. The pupils’ 
attitudes toward clean farming became less favorable as a result 
- of the instructional materials designed to have this effect, and the 

effects lasted for at least four months for the urban and eight 
months for the rural pupils. 

Attitudes as characteristics of groups are probably equal in interest 
and importance to attitudes as characteristics of individuals. We 
frequently wish to ascertain how pupils differing in grade level, 
sex, rural or urban environment, school attended, educational ex- 


"TETECUTERN 
ity it F l: TT I TI 
+ H if jii ij : i! 

] iiit yeh litt eit Hi i: 
H HE JHB. it: H x 11 H - 1 j 
piia s lin aia H | "n 
d TU ín 111 i 
ulli 1j hl - i ii $ HEHP 
Hades ta pun 


“ 


LE 


VE Eu 
JE iil H 3 


UE = be wold adu 


sie rok ad paing wid, aad chs a wi be Aa owa aii ii de pne 
uni bon 


Atithi ood Robatrd Aiport 
sen a 
rie 
rre 
n 
dere tm 
mer 
nibubsiey of 
magli gee 
em 
m 
Tm 


ai amans iun 


od mid porshen Thee aran compres edhe te then 
the rmm 


islam. the pete te dio em 


IL 


À beat arongiy then em ime shoshi be tree to do sn ber 
si 
mem enit 
EI 
we 
p 


[al 


ei de men PII LII LI 
Dagia Vaw Sendy. Kath ed he tmm m 


IL II 
dmpn W ke b aa 
(heut de v 
deba 
ene 

two 


Mu Toren en PAm ames Come uman p a 


Desde of Bobejo Trae ip and ap h 


406 How to Evaluate 


servatism, and uncertainty, The pupil's agreement with himself is 
measured by giving the two tests on separate days and determining 
a consistency score on the basis of responses to paired items, Pro- 
vision is made for obtaining the four separate scores for each of the 
six classes of social issues so that a somewhat detailed picture of 
the pupil's social attitudes becomes available, The following il- 
lustrates the form of a pair of opposite statements: 


Test 4.21, Item 13. “The best of medical care should be provided for 
rich alike." 


and poor 
"Test 4.31, Item 103. "The quality of medical services made available to 
individuals should depend upon their ability to pay." 
A Scale of Beliefs for Junior High School (Tests 4.4 and 4.5) has 
also been prepared to enable evaluations of pupils in Grades 7 
through 9. 


Tama 18.—ArrrrUDE Scutum axo Qummowwanum 


Grad s 
Name of Test No, of Forms Required to (see list 
DAMM Give — pp 61-46) 
Thorwone-edited Arti-  g-16, adul LI $ » 
tude Scales 
Remmers-edited Attitude 7-16, adults 2 5 
Wi Scale of yn * wo y 
vic Beliefs 
What Would You Do? Tu i LJ u 
Scale of Beliefs: Tests 4-11 yu 1 p “ 
and 4.54 
Scale of Beliefs for Junior T9 L p» [Lj 
School: Tests 4.4 
^5 
Tentative Checklist for 9-16, adults : no é 
Ai 
on Fifty Crucial Social, 
Economic, and Politi- 
cal Problems 
Ch Oo. be à > : 
y D 10 
Orientation Test 16, adults ' [ 7 
Social Attitude Scale: 9-16, adults t 5 * 
Constitution, Detmoc- 


Attitudes and Related Aspects 407 


Other devices for attitude evaluation which may prove useful in 
various school situations are Bruner and Linden's Tentative Check. 
list for Determining Attitudes on Fifty Crucial Social, Economic, 
and Political Problems; Case and Limbert’s Around the World, 
designed to test children's attitudes and information about interna- 
tional relations; Lentz's C-R Opinionaire, designed to measure 
general conservatism or radicalism with respect to human affairs; 
Lewerenz and Steinmetz's Orientation Test, designed to elicit 
opinions concerning a large number of superstitions and delusions; 
and Rosander's Social Attitude Scales designed to measure by the 
Thurstone technique attitudes toward the Constitution, democracy, 


and social justice, 


ATTITUDES AS EDUCATIONAL-VOCATIONAL INTERESTS 


Approaches to the Evaluation of Interests—In order to ascertain 
a pupil's interests, several approaches could be used. In the first 
place, the pupil might be asked to write an account of the activities, 
vocational and avocational, in which he finds the greatest pleasure 
and those from which he derives the least enjoyment, In the second 
place, direct observation of the student, both by his fellow students 
and by his teachers and parents, could be employed. These observa- - 
tions could be accumulated in the form of anecdotal records de- 
scribed in the preceding chapter. The amount of time the individual 
spends in various activities, the degree of pleasure or displeasure 
he shows in them, the breadth or narrowness of his interests in 
them might all be inferred from these observations, The observa- 
tions could range over such activities as school subjects, recreation, 
jobs, or hobbies, 

In the third place, experimental situations might be set up in 
which the individual is required to participate in various activities 
which could serve as samples of those in which his interests are 
being evaluated. Conclusions concerning interests could then be 
drawn from his behavior in the experimental situation. Thus pupils 
might be required to canvass a neighborhood for advertisements 
for the class yearbook. Their enthusiasm and success in such an 
activity might well serve as the basis for statements concerning 
cach one’s interest in such occupations as real estate salesman, life 


408 How to Evaluate 


insurance salesman, sales manager, advertising man, and, in general, 
commercial work involving dealing with people. 

Typical school activities may similarly serve as experimental try- 
out situations. The interest of pupils in such school subjects as 
mathematics, history, bookkeeping, chemistry, physics, or English 
composition may be interpreted in terms which are meaningful for 
various occupations in the world of work. 

Ability in school subjects is by no means an adequate indicator of 
interest in them or in the occupations which are related to them. 
This is due partly of course to the inadequacy in reliability and 
validity of available measures of ability in the various school sub- 
jects. Symonds (28 :244-245) in summarizing the evidence on the 
relationship between ability and interests concluded that the relation- 
ship between abilities and interests is distinct but not close and 
that for junior high school pupils it is low enough so that one can 
usefully supplement the other for the purposes of guidance. At the 
high school and college levels, the disparity between abilities and 
interests has also been found with sufficient frequency and magni- 
tude to justify the evaluation of the latter as a separate aspect of 
students. For example, Garretson (8:72) reported only a slight 
relationship between the degree of preference for and ability in a 
curriculum as measured by objective tests of curricular aptitude. 
Thus, whereas he concluded that his preference questionnaire was 
valid for the determination of curricular “inclination,” he did not 
consider it useful for the prediction of the degree of success that 
would be achieved by the pupil in the curriculum for which he 
shows the greatest preference. Perhaps if curricular success were 
defined in terms of the pupil’s happiness in and satisfaction with 
his curriculum, a test of interest could be used as a valid predictor 
of such adjustment. 

Apart from all the approaches to interest evaluation mentioned 
above, there are a number of paper and pencil ‘testing devices 
which enable a quantitative determination of a pupil's vocational 
and curricular interests in terms of a test score. These tests are 
for the most part of two kinds: direct approaches to the individual's 
likes and dislikes, and indirect approaches which assume some 
relationship between interest and a measure of information or ability 
in a given field. These two approaches have been distinguished by 


Attitudes and Related Aspects 409 


Fryer (7) as subjective and objective interests, respectively. He 
considers subjective interests to denote feelings of pleasantness and 
unpleasantness accompanying interest experiences; they are experi- 
ences of liking, disliking, or indifference. Objective interests, on the 
other hand, are reactions or behaviors which usually but not always 
are in accordance with the individual's feelings, experiences, or 
subjective interests. The only approach to subjective interests is 
through the individual’s expression of them, his verbal indications 
of like or dislike, attraction or avoidance. Objective interests are 
ascertainable not only by verbal reports of like or dislike, interest 
or disinterest, but also through actual behavior which can be ob- 
served by another person. 

Which of these two types of interests should be measured depends 
on various considerations. Ascertaining subjective interests has dis- 
advantages which are derived from the various kinds of errors 
that may accompany an individual’s report of his subjective interest. 
One type of error is the prevarication error whereby an individual 
simply does not tell the truth about his feelings or inner experiences 
concerning a given matter. Thus a boy may be unwilling to be 
frank about his liking for a given occupation because he considers 
its socio-economic status too low. The boy who glories in working 
with automobile engines as a “grease monkey” but whose parents 
have taught him to disrespect working with his hands may commit 
a prevarication error in reporting his subjective interest. An in- 
formation error may result whenever an individual reports subjective 
interest on the basis of false information concerning the given ob- 
ject. The boy who reports an interest in engineering simply because 
he considers it to be outdoors masculine work but who has no ap- 
preciation of its mathematical nature commits this error. The 
generalization error results from reporting one’s subjective interest 
on the basis of information which is inadequate in range and 
extent concerning the given object or activity. The boy who reports 
an aversion for selling on the basis of a childhood failure in selling 
magazines may be unjustifiably generalizing for all selling occupa- 
tions from an insufficient sample of that kind of work. 

Despite these sources of error, measures of subjective interest 
have been found on the whole to have a satisfactory degree of 
validity in correlating with happiness and success in various kinds 


410 How to Evaluate 


of activity. Most of the paper and pencil approaches to interest 
have been in terms of subjective interests since these have been 
found to enable the measurement of a far greater variety of interest 
patterns than can be approached in terms of objective interests. 
Most objective interest tests have taken’ the form of information 
tests. The assumption underlying these tests in the measurement 
of interest has been well expressed by Flanagan (6). Defining in- 
terest in an activity as "the extent to which an individual selects 
these activities in preference to others in a free choice situation," 
Flanagan proposes to measure interest in terms of the relative 
strength of an individual's information on contemporary affairs in 
the following six fields: political events, social and economic events, 
science and medicine, literature, fine arts, and amusements. 
Information tests sample the person's knowledge concerning a 

sample of facts in a given field. If he has been interested in that 
field, he will presumably have tended to select reading and sources 
of information concerning it in preference to other fields of interest. 
Consequently he should have more information as a result of his 
wider and more attentive reading and activity in that field. A 
further assumption implicit in information tests as a basis for in- 
terest evaluation is that facts and events in a preferred field. will 
be better remembered as well as more sought out and attended to 
by individuals interested in that particular field. That interest and 
memory are thus correlated seems a plausible deduction despite the 
lack of evidence explicitly related to this question. In 1898 Wissler 
(7:262) reported an attempt to measure the interest of children 
in various kinds of reading by asking them to “Write the subject 
of the lessons that you remember from the reader you used last 
year." Fryer (7) describes the following attempts to evaluate interest 
by means of information tests: 

1. Agricultural engineering tests (Burtt) 

2. Children's play interests (Terman) 

3. Children's readings (Wissler) 
4. College women's occupational interests. (McHale) 
5. Girls’ general trade interests (oops) 
6. Men's general trade interests (O'Rourke and Toops) 
Men's mechanical interests (O'Rourke) 
. Social interests (Ream) 


Attitudes and Related Aspects 411 


But in surveying the correlations of these tests of interests with 
other estimates of interest and with measures of ability and achieve- 
ment, he concludes (7:290) that “there is no valid evidence that 
something different to ability is measured by information tests. 
What is thought to be an evidence of interest in these measures of 
information may be but a measure of the extent to which these 
tests are measures of the same abilities. The safest conclusion, as 
already stated, is that information tests measure information. But 
the theory persists that in achievement, as evidenced in the acquisi- 
tion of information, there is present an effect of interests as well as 
of abilities." 

Let us now turn to illustrations of subjective interest evaluation 
devices. In the course of the following discussion, the general nature 
of the methods by which the specific devices were constructed and 
the consequent nature of the scores and interpretation which they 
provide will be indicated. 

Available Interest Evaluation Devices.—The Dunlap Academic 
Preference Blank, designed for pupils in Grades 6 to 9, consists of 
go words or phrases representative of special academic areas to 
which the pupil responds by indicating like, dislike, indifference, 
or unfamiliarity. By means of ten different scoring keys the pupil’s 
responses can be interpreted with respect to interest in such sub- 
ject-matter areas as are measured by the New Stanford Achieve- 
ment Test and the Metropolitan Achievement Test. These are 
paragraph meaning, word meaning, history, language usage, geog- 
raphy, literature, arithmetic, and general achievement. In addition, 
scores which reflect mental ability and intellectual alertness can be 
obtained. The Blank has been subjected to extensive and repeated 
experimentation and statistical refinement. The median reliability of 
its scores for pupils in Grades 6 to 9 ranges from .7o to .83 by the 
alternate-form method. The validity of the scores was estimated 
by correlations with ‘scores obtained five months later on the 
Metropolitan and New Stanford Achievement Tests; the coefficients 
of correlation ranged from .28 to .73 for Grades 6 to 8. 

It is probable that the Dunlap Academic Preference Blank will. 
prove useful as a basis for guiding junior high school pupils among 
various school subjects and curricula. The interests measured have 
been demonstrated to be predictive of general achievement and 


412 How to Evaluate 


achievement in specific school subjects and indicative of intelligence 
or scholastic aptitude. 

The Garretson-Symonds Interest Questionnaire for High School 
Students is designed to help boys choose between the academic, 
technical, and commercial curricula on entrance to high school. It 
consists of 234 items concerning occupations, activities, school sub- 
jects, job activities, a school paper, the football team, student activi- 
ties, prominent men, things to own, and magazines. The pupil in- 
dicates whether he likes, dislikes, or is indifferent to each item. 
Each response is rated + 1, 0, or — 1 in accordance with the dif- 
ferences in response of pupils in each curriculum as compared 
with that of pupils in the other two curricula. The resultant scor- 
ing keys were applied to a selected sample of high school students 
in each of the three curricula and were found “to classify correctly 
four out of five pupils on the basis of technical preference and 
three out of four pupils for commercial and academic inclination” 
(8 : 53). Garretson was able to conclude that, although the prefer- 
ence scores did not correlate substantially with success in the various 
curricula, they would enable the guidance of pupils into curricula 
for which they are fitted by inclination and consequently would 
reduce the number of failures, the loss of time involved in a change 
of curriculum, and withdrawals from school occasioned by cur- 
ricular maladjustment. 

Cleeton's Vocational Interest Inventory provides ratings of a 
pupil's interest not in terms of single occupations but in terms of 
the following nine types of related occupations: 

1. Biological sciences—physician 

2. Selling 

3. Physical sciences—engineer, technologist, chemist, mathe- 
matician 

4. Social sciences—teacher, minister, social or YMCA worker 

5. Business administration—purchasing agent, business man- 
ager, clerk 

6. Legal and literary occupations—lawyer, journalist 

7. Mechanical occupations—various skilled trades 

8. Finance—accountant, statistician, banker, broker 

9 


. Creative or public performance occupations—actor, musician, 
artist 


Attitudes and Related Aspects 413 


The above group of occupations is for men; for the women's in- 
ventory a different group is provided. There are 670 items; each 
item in nine sections of 70 items each, grouped on the basis of oc- 
cupational similarity, is to be marked + to indicate Yes or Like, 
or o to indicate No or Dislike. A tenth group of 40 items provides a 
score for introversion-extroversion. The scoring key was validated on 
the basis of the responses of 7424 persons successfully engaged in 
representative occupations. A high inventory rating which agreed 
with each individual’s actual occupation was found in 76 per cent of 
the cases. Separate letter-grade interpretations for each sex are 
available for Grades 9, 10, 11, 12, college freshmen, and adults. 
Reliability coefficients for various groups centered around .88 by the 
split-half method. The procedure by which the grouping of occupa- 
tions was determined is not explicitly stated by the authors but 
agrees well with occupational classifications derived by the factor 
analyses of Thurstone and Strong. The method of scoring, one 
point being given for each plus mark in a section, is particularly 
advantageous in comparison with such techniques as that used in 
Strong’s Vocational Interest Blank. The women’s occupational 
groups include such occupations as office work, selling, natural 
sciences, social service, creative work, teaching, performance and 
personal service, and mechanical and household occupations. 

The Allport-Vernon Study of Values is a test designed to yield 
measures of the six evaluative attitudes which, according to Spranger, 
are the most revealing aspects of personality. These values, described 
in Chapter VI, are measured by the Allport-Vernon test in terms 
of responses to 30 two-alternative items and 15 four-alternative items. 
The two-alternative items are illustrated by the following: 


1. The main objects of scientific research should be the 
discovery of pure truth rather than its practical applica- | (a) | (b) 
tions. 
(a) Yes; (b) No... HMM 


If the pupil agrees with alternative (a) and disagrees with (b), he 
writes 3 under (a) and o under (b); if he has a slight preference for 
(a) over (b), he writes 2 under (a) and 1 under (b). Agreement 
with (b) or slight preference for (b) is similarly indicated. Illustra- 
tive of the four-alternative items is the following: 


414 How to Evaluate 


1. Do you think that a good government should aim chiefly at— 


a. more aid for the poor, sick and old 

b. the development of manufacturing and trade 

c. introducing more ethical principles into its policies and 
diplomacy 

d. establishing a position of prestige and respect among 
nations. 


The pupil writes 1, 2, 3, or 4 before cach alternative to indicate the 
order of his preference for them. 

Extensive research with the Study of Values reviewed by Duffy 
(5) has in the main corroborated the findings and expectations of 
its authors with respect to norms, retest constancy, and differences 
between the sexes:in evaluative attitudes. The scores have been shown 
to be significantly related to academic achievement and to various 
educational and occupational groupings. The six scores yielded by 
the test may thus prove valuable for acquiring self-understanding 
and providing a basis for educational and vocational guidance. 

Several shortcomings of the test have, however, been revealed by 
the numerous evaluations which have been made of its reliability 
and validity, In the first place, the scores for social value have been 
found to be far less reliable (around 50) than the other scores, 
whose reliability ranges around .75. This defect is probably the 
result in large part of an ambiguous conception of social value as 
reflected in the test, some of the items defining it in terms of 
“sociability” while others consider it in terms of “social conscious- 
ness” or concern for the masses of humanity. Other studies have 
shown, by means of correlation coefficients and factor analysis, 
that the political and economic values may be considered nearly 
synonymous and that the aesthetic value is nearly the opposite of 
the same aspect of people. Thus Lurie (17) combined these three 
values into one which he called “Philistinism” on the basis of his 
factor analysis of an original battery of tests of the six values. 

Some writers have considered the scoring system of the Study of 
Values a further disadvantage. This is because scores for each value 
are obtainable not in absolute terms but merely in terms of the 
relative strengths of the values within each person. Thus a person 
who has a high real value for all six categories may obtain scores 
for some of the six that are lower than those made by another 


Attitudes and Related Aspects 415 


person who has lower real values for those categories. A high score 
in one value is obtainable only at the expense of scores in other 
values; every individual is given a total of 180 points which he dis- 
tributes among the six values. This feature may, however, be de- 
sirable if the object is to study not “how much” evaluative attitude 
various people have but rather how they choose to distribute their 
preferences when given equal quantities of preference to express. 

To circumvent the first two of these shortcomings and provide an 
alternative to the third, Glaser and Maller (9) have constructed an 
Interest-V alues Inventory, which is more applicable to pupils at the 
high school level than the Study of Values. Four types of interest 
are measured: theoretic, aesthetic, social, and economic. The social 
interest is conceived as interest in the social welfare of groups, or 
social consciousness. A relative ranking of values is used in two parts 
of the test, and “absolute” scores are obtainable in the third part. 
Parts Four and Five elicit self-ratings and personal data not strictly 
related to the evaluative attitudes. Validation of the test was ob- 
tained on the basis of differentiations between four groups: persons 
interested in mathematics and science for the theoretic value, art 
and music for the aesthetic value, social work and nursing for the 
social value, and business and advertising for the economic value. 

The Kuder Preference Record is distinctive in that it uses the 
paired-comparison technique requiring the pupil to choose which 
of three described activities he would ordinarily prefer most and 
which least. Typical choices are: go to a movie or attend a symphony 
concert; go to an amusement park or play a musical instrument at 
home; take a photograph of a champion swimmer or take a photo- 
graph of a table you would like to make. From pupils’ choices be- 
tween hundreds of such items, scores can be obtained to indicate in- 
terests in (1) scientific activities, (2) activities involving compnta- 
tion, (3) musical activities, (4) artistic activities, (5) literary activ- 
ities, (6) social service activities, (7) persuasive activities, (8) clerical 
activities, and (9) mechanical activities. The procedure for scoring 
the responses is extremely simple, and can be done by the student 
himself simply by counting the number of punctured circles on the 
reverse side of specially prepared answer sheets. Although the nine 
scores have not yet been related to specific occupations, suggestions 
are given in the test manual for interpretations of this kind, and the 


416 How to Evaluate 


procedure for research in this direction has already been determined. 
‘The reliability of the nine scales, determined by the Kuder-Richard- 
son method (see Chapter X), ranges from .84 to .go. Validity is 
indicated by the average profiles of groups of students who have 
chosen various occupations In each of these profiles the high and 
low scores indicate to a striking degree that different occupational 
groups are differentiated by the Kuder Preference Record. 

Evidence is provided in the manual that the scores do not cor- 
relate substantially with those on Thurstone's Primary Mental Abili- 
ties Test. However, the scientific, computational, and literary scales 
seem to correlate positively with average college grades, while the 
persuasive, artistic, and social service scores tend to correlate nega- 
tively. The usability of the test is increased by its ingenious presenta- 
tion of the test items on pages of unequal width attached to a 
heavy cardboard cover by small metal rings; when the pages are 
turned, the appropriate column of the detachable answer sheet is 
exposed close to the margin of the page containing the items. As 
successive pages are turned, their decreasing width brings the ap- 
propriate column of answer spaces adjacent to the question page. 
Since the answer sheets are detachable, each test booklet can be 
used more than once. The directions for self-scoring are clear and 
appeal to student interest. 

Probably the most widely used device for the evaluation of in- 
terests is Strong's Vocational Interest Blank, which is available in 
forms for men and for women. The blank consists of approximately 
400 items dealing with (1) likes and dislikes for occupations, 
(2) likes and dislikes for school subjects, (3) likes and dislikes for 
amusements, (4) likes and dislikes for activities, (5) likes and 
dislikes for peculiarities of people, (6) order of preference of ac- 
tivities, (7) order of importance of factors affecting one’s work, 
(8) order of preference of men one would most and least like to 
have been, (9) positions you would most and least prefer to hold 
in a club or society, (10) comparison of interests between two items, 
(11) self-rating of present abilities and characteristics. The likes and 
dislikes are indicated by encircling one of the three letters, L I D, 
to indicate like, indifference, and dislike. The orders of preference 


* Profiles are bar diagrams or charts which graphically represent the strengths of 
various scores in particular individuals or groups, 


*. Attitudes and Related Aspects 417 


Taste 19.—Inrerest Evacuation Devices 


Time (min.) Publisher No. 


Grades ; B 
Name of Test No. of Forms Required to (see list on 
Designed for Give pp: 561-562) 
EE a ŘŮŐ—— 
Dunlap Academic Prefer- 6-9 2 15 38 
ence Blank 
Garretson-Symonds In- 8-10 1 30 6 
terest Questionnaire 
for High School Stu- 
dents (Boys) 
Cleeton's Vocational In- — 9-16, adults 1 45-55 24 
terest Inventory separate in- 
ventories 
for men and 
women 
Allport-Vernon Study of 12-16, adults I 30 19 
Values 
Glaser-Maller Interest- — 9-16, adults 1 30 6 
Values Inventory 
Kuder Preference Record 9-16 I 40-60 30 
Strong's Vocational In- 11-16, adults 1 40 34 
terest Blank separate 
blanks for 
men and 
women 


are indicated by checking in one of three columns. Although the 
test is administered without a time limit, about 40 minutes is re- 
quired by the average person. The blank is scored by means of 34 
separate scoring keys each of which provides a score indicative of 
interest in a specific occupation. These 34 occupations are grouped 
into ten classifications in terms of groups of occupations determined 
by factor analysis. 

These scoring keys were determined by giving the test to large 
samples of men who had been successfully engaged in their occupa- 
tion for at least three years prior to taking the test. Each of the 
400 items is given a weight for each occupation which is derived 
from the difference between the proportion of men in the given oc- 
cupation who marked the item in a certain way and the proportion 
of “men in general” who marked the item in that way. At least 
250 men less than 6o years of age were included in each of the oc- 


418 How to Evaluate 


cupational criterion groups. The greater the difference between the 
proportion of men in the occupational group who like, dislike, or 
are indifferent to each item and the proportion of “men in general,” 
the more the item differentiates between the two groups and the 
larger the weighting given to that response as an indicator of 
interest in that occupation. The weights range from plus 4 through 
o to minus 4. (In an earlier form of the test they ranged from minus 
11 through o to plus 11; the reduction in range has been found to 
facilitate scoring without any loss of differentiation between occupa- 
tions.) 

In scoring the blanks, the scoring process must be repeated 
separately for each occupation for which a score is desired. Thus if 
measures of interests in 30 occupations are desired, the blank must be 
scored 30 times. Obviously, scoring is a laborious procedure and 
constitutes perhaps the major drawback of the blank. In his manual, 
Strong discusses three methods of scoring: hand-scoring, scoring 
with the Hollerith machine, and with the International Test Scoring 
Machine. It has been estimated that a complete hand-scoring for 
34 occupations cannot be done even by an experienced scorer in less 
than four and a half hours. Since instructions for scoring by all three 
methods are given in the test manual, they will not be discussed 
here. Machine scoring is usually available to teachers only if they 
send the test blank to central offices at various universities. Several 
attempts (19, 9, 16) have been made to simplify the scoring. 

Contrary to the finding of Strong, and of Rock and Wesman, 
Peterson and Dunlap have found (20) that by using a range of item 
weights from plus 1 through o to minus r instead of plus 4 through 
© to minus 4, scores could be obtained which correlated on the 
average about .96 with those obtained by using the more variable 
weights, When the scores from these reduced weights were used to 
predict the scores which would be obtained by using Strong’s more 
variable weights, the predicted scores agreed almost perfectly with 
those obtained by Strong’s scoring method. Serious changes in the 
scores occurred only about 3 per cent of the time and could be easily 
checked by rescoring whenever necessary. It thus appears that the 
labor of scoring the Strong Vocational Interest Blank may be con- 
siderably reduced without any loss of validity. 


Attitudes and Related Aspects 419 


After the raw scores have been obtained for each occupation they 
are converted into the letter grades A, B +, B, B—, C + or C. An 
A rating in an occupation means that the individual has to a high 
degree the interests of persons successfully engaged in that occupa- 
tion. B-+, B and B — ratings mean the same thing with a less 
degree of certainty. C ratings mean that the individual definitely 
does not have these interests. 

The reliability of the scores for the various occupations is on the 
average about .88 for college seniors by the split-half technique. 
When 285 college seniors were retested after a period of five years, 
the retest correlation on 21 occupational scales averaged 75. The 
validity of the blank in terms of the degree to which it differentiates 
men successfully engaged in an occupation from "men in general," 
that is, men successfully engaged in other occupations, has also been 
found to be quite high. For example, only 15 per cent of 933 non- 
engineers rate A in engineering interests, in contrast with 75 per cent 
of engineers who rate A. Similarly the blank differentiates well be- 
tween men who are highly successful in a given occupation and 
other men in the same occupation who are not so successful in terms 
of annual income from it. Thus, 67 per cent of insurance agents with 
ratings of A were found to be successful on this basis in contrast to 
but 6 per cent of those with ratings of C. 

The retest reliability of the scores of high school students has, in 
general, been found to be somewhat lower than that obtained for 
college students. This decrease is not, however, so great as to make 
the scale entirely without value for the evaluation of high school 
students’ interests. Canning, Taylor, and Carter (2) were able to 
conclude that the blank indicates certain facts about the interests 
of high school boys with considerable reliability. “Thus, if the high 
school boy receives a ‘C’ rating on the first test, there is an 83 per 
cent chance that he will receive the same rating and only a 1 per 
cent chance that he will receive an ‘A’ rating two years later. And 
if he receives an ‘A’ rating on the first test, there is an 88 per cent 
chance that he will receive a rating of ‘B’ or higher two years later.” 

Various supplementary scoring keys have been derived for use 
with the Strong Vocational Interest Blank. Among these is the 
Young-Estabrooks Studiousness Scale (33), with which it is possible 


420 How to Evaluate 


to derive a score for measuring non-intellectual factors associated 
with scholastic success. Another scale is designed for measuring 
maturity of interests, or the extent to which an individual has 
arrived at a stable, long-lasting set of interests. An occupational level 
key is also available for determining whether a student can probably 
find a satisfactory occupation among those which make relatively 
few demands on the worker or whether he prefers for his adjustment 
one of the more professionalized and exacting occupations at the 
upper socio-economic level. 

Tussing (30) succeeded in deriving sets of scoring weights for 
interpreting patterns of likes and dislikes on the Strong blank in 
terms of such traits as self-confidence and sociability as measured by 
the Bernreuter Personality Inventory, theoretical and economic 
evaluative attitudes as measured by the Allport-Vernon Study of 
Values, social adjustment as measured by the Bell Adjustment In- 
ventory, and scholastic aptitude as measured by the American 
Council on Education Psychological Examination for College Fresh- 
men. The splithalf Spearman-Brown reliability coefficients of these 
scoring keys ranged from 7o to .87, and the validity coefficients of 
the scores, using the various tests as criteria, ranged from .45 to .56. 
It is thus apparent that responses to the Strong blank may be uscful 
for purposes other than vocational interest measurement. The in- 
direct, disguised nature of the blank when used for these other 
purposes increases its value for obtaining frank and meaningful in- 
dicators of these other aspects of pupils, The general applicability 
of Tussing’s scoring weights must, of course, be considered in terms 
of their being based on the responses of male adults whose average 
age was 27 years and whose intelligence and educational background 
were perhaps higher than average. 

Satisfactory interpretation of the blank and the counseling of 
students on the basis of information obtained with it are possible 
only for those who have a good understanding of its technique and 
of the accumulated experience of people who have made wide use 
of it in counseling. On the basis of his extensive experience, Darley 
(4) has prepared a manual in which he stresses the importance of 
“pattern analysis” rather than specific occupational keys and dis- 
cusses such problems as the interpretation to be given to scores in 
which no significant interest pattern, or no A ratings, appear. 


Attitudes and Related Aspects 421 


That the interpretation of interest ratings obtained with the 
Strong blank should not be straightforward in the narrow sense is 
indicated by correlation coefficients obtained by Johnson (13) be- 
tween various scores and first-year grade-point averages for fresh- 
men engineering students. The score for engineering interests 
correlated .23 with grades; but equally high coefficients were ob- 
tained for the certified public accountant, mathematician, and 
psychologist interest scores; the correlation coefficient of the score 
for chemist was .33, the highest coefficient obtained. Thus in counsel- 
ing students concerning their chances for success in the college 
engineering curriculum, as much help is available from several other 
scoring keys as from the score for engineering interests. 

The Strong Vocational Interest Blank is usually valuable as a 
stimulator of thought concerning vocations, of self-scrutiny, and of 
attention to interests as a factor in vocational choice. Even when 
there is much reason to suspect the meaningfulness for a given in- 
dividual of the scores obtained, the educational advantages of using 
a vocational interest blank in guidance are still considerable. 


SUMMARY 


Attitudes may be evaluated either as educational outcomes or as 
aspects of pupils relevant to educational and vocational guidance. 
Attitude scales and questionnaires have been most used for evaluat- 
ing attitudes. Attitude scales are usually constructed in accordance 
with the equal-appearing-intervals technique. They may be in the 
form of master scales that enable the measurement of attitude toward 
any one or more of a class of objects. Attitude questionnaires, in 
which responses are simply summated, have certain advantages and 
disadvantages in comparison with scales. The purposes for which 
attitude scales and questionnaires are valid have been outlined by 
many research studies. Attitudes as educational-vocational interests 
may be evaluated in several ways, of which the most frequently used 
are paper and pencil inventories of subjective interests, Some avail- 
able interest evaluation devices are described. 


QUESTIONS 


1. List some attitudes that should be educational outcomes for all 
pupils, and other attitudes in which pupils may differ among them- 


422 How to Evaluate 


selves without coming into conflict with the aims of the educative 
process, 

2. Could a pupil have a favorable attitude toward a school subject in 
terms of its worth-whileness to society or its social significance and 
yet have an unfavorable attitude or low interest in it in terms of his 
desire to study it? How would you discover such a discrepancy? 

3. A number of investigators have found that social studies are dis- 
liked by pupils. Should this attitude be changed or should it be re- 
garded as a matter of taste in which pupils should be permitted to 
follow their own inclinations? How would you undertake to reveal 
such an attitude, its causes, and the means by which it could be 
changed? 

4. Sketch the differences in attitudes which might lead one boy to 

become. a scientist and another to become a professional bridge 

player. 

Many boys consider language studies to be “sissy” and effeminate. 

This attitude frequently has undesirable effects on their achievement 

in this field. How would you go about revealing the existence of 

such attitudes and their causes? What could be done to change them? 

6, Attitudes toward yocations, or vocational interests, have been shown 
to be markedly affected by information concerning vocations ac- 
quired when pupils construct workbooks dealing with them. Design 
an experiment to show this effect and discuss its implications for 
vocational guidance. 

7. How can racial attitudes, attitudes toward social and political ideals, 
and “class-consciousness” affect a pupil’s fitness for certain occupa- 
tions? What are the implications of this for the distinction between 
attitudes as educational outcomes and as factors to be considered in 
educational and vocational guidance? 


c 


REFERENCES 


1. Beery, J. R., and Bare, T. H., “The measurement of attitudes,” in 
Briggs, T. H., and others, The Emotionalized Attitudes, New York: 
Bureau of Publications, Teachers College, Columbia University, 
1940. 

2. Canning, L., Taylor, Kathryn V. E, and Carter, Harold D., 
"Permanence of vocational interests of high school boys," Journal 
of Educational Psychology, 32 : 481-494 (1941). 

3. Corey, S. M., "Professed attitudes and actual behavior," Journal of 
Educational Psychology, 38 : 271-280 (1937). 


I2. 


13. 


14. 


Attitudes and Related Aspects 423 


. Darley, J., Clinical Aspects and Interpretation of the Strong Voca- 


tional Interest Blank, New York: The Psychological Corporation, 
1941 


. Duffy, Elizabeth, “A critical review of investigations employing the 


Allport-Vernon 'Study of Values' and other tests of evaluative 
attitude," Psychological Bulletin, 37 : 597-612 (1940). 


. Flanagan, J. C., “Measuring interests," Advisory Service Bulletin, 


No. 4, New York: Co-operative Test Service, May, 1940. 


. Fryer, D., The Measurement of Interests, New York: Henry. Holt & 


Company, Inc., 1931. 


. Garretson, O. K., Relationships Between Expressed Preferences and 


Curricular Abilities of Ninth-Grade Boys, New York: Bureau of 
Publications, Teachers College, Columbia University, Contributions 
to Education, No. 396, 1930. 


. Glaser, E. M., and Maller, J. B., "The measurement of interest 


values,” Character and Personality, 9 : 67-81 (1940). 


. Hall, W., “The effect of defined social stimulus material upon the 


stability of attitudes toward labor unions, capital punishment, 
social insurance and Negroes,” Studies in Attitudes III, Studies in 
Higher Education XXXIV, Bulletin of Purdue University, pp. 7-19, 
1938. 


. Harper, M. H., Social Beliefs and Attitudes of American Educators, 


New York: Bureau of Publications, Teachers College, Columbia 
University, Contributions to Education, No. 294, 1927. 

Hinckley, E. D., “The influence of individual opinion on construc- 
tion of an attitude scale,” Journal of Social Psychology, 3 : 283-296 
(1932). 

Johnson, A. P., “The prediction of scholastic achievement for fresh- 
man engineering students at Purdue University,” Studies in Higher . 
Education XLIV, Purdue University, 1942. 

Kelley, Ida B., “The construction and validation of a scale to 
measure attitude toward any institution,” Studies in Attitudes, 
Studies in Higher Education XXVI, Bulletin of Purdue University, 


35 : 18-36 (1934). 


. Kopas, J. S. “The pointtally: A modified method of scoring the 


Strong Vocational Interest Blank,” Journal of Applied Psychology, 
22 : 420-436 (1938). 


. Likert, R., “A technique for the measurement of attitudes,” Archives 


of Psychology, No. 140, 1932. 


. Lurie, W. A., “A study of Spranger's value-types by the method of 


factor analysis,” Journal of Social Psychology, 8 : 17-37 (1937): 


18. 


19. 


20. 


21. 


22. 


30. 


3r. 


How to Evaluate 


Newcomb, T. M., "Social attitudes and their measurement," in 
Murphy, G., Murphy, Lois B., and Newcomb, T. M., Experimental 
Social Psychology, Revised Edition, New York: Harper & Brothers, 
1937. 

voce F., and M. Rosanna, “Children’s attitude toward law as 
influenced by pupil self-government,” Studies in Attitudes, Series 
Il, Studies in Higher Education XXXI, Bulletin of Purdue Uni- 
versity, pp. 15-26, 1936. 

Peterson, Bertha M., and Dunlap, J. W., “A simplified method for 
scoring the Strong Vocational Interest Blank,” Journal of Consulting 
Psychology, 5 : 269-274 (1941). 

Peterson, Ruth C., and Thurstone, L. L., Motion Pictures and the 
Social Attitudes of Children, New York: The Macmillan Company, 
1933. 

Remmers, H. H., "Generalized attitude scales—studies in social- 
psychological measurements,” Studies in Attitudes—A Contribution 
to Social-Psychological Research Methods, Studies in Higher Edu- 
cation XXVI, Bulletin of Purdue University, 35 : 7-17 (1934). 
Rock, R. T., and Wesman, A., “The comparative efficiency of 
various methods of weighting interest test items,” Psychological 
Bulletin, 36 : 569 (1939). 


» Stagner, R, “Fascist attitudes: Their determining conditions,” 


Journal of Social Psychology, 7 : 438-454 (1936). 

Stouffer, S. A., "Experimental comparison of a statistical and a 
case history technique of attitude research,” Publications of the 
American Sociological Society, 25 : 154-156 (1931). 


. Strong, E. K., Jr., Manual for Vocational Interest Blank for Men, 


Stanford University: Stanford University Press, 1938. 


» Strong, E. K., Jr. "Procedure for scoring an interest test,” Psy- 


chological Clinic, 19 : 63-72 (1930). 


- Symonds, P. M., Diagnosing Personality and Conduct, New York: 


D. Appleton-Century Company, Inc., 1931. 


- Thurstone, L. L., and Chave, E. J., The Measurement of Attitudes, 


Chicago: University of Chicago Press, 1929. : 

Tussing, Lyle, An investigation of the possibilities of measuring 
personality traits with the Strong Vocational Interest Blank, Doctor’s 
thesis on file in the Purdue University Library, Lafayette, Indiana, 
1941. 

Williamson, A. C., and Remmers, H, H., “Persistence of attitudes 
concerning conservation issues,” Journal of Experimental Education, 
8 : 354-361 (1940). 


Attitudes and Related Aspects 45 


ightstone, J. W., Wrightstone Scale of Civic Beliefs, Yonkers: 

rld Book COR 1938. ; 
g, C. W., and Estabrooks, G. H., “Report on the Young- Esta 

| brooks SERA Scale for use EN the Strong Vocational Inter- 
est Blank for Men,” Journal of Educational Psychology, 28 : 176-187 


CHAPTER XVIII 


Environment and Background 


THE READER WILL RECALL THAT IN CHAPTER VII THE SOCIO-ECONOMIC 
environment and background of pupils was divided into the home, 
the community, and the school. The home, in turn, was divided 
into (1) parent-to-parent relationships, (2) parent-to-child relation- 
ships, (3) child-to-child relationships, and (4) socio-economic status. 
The present chapter will attempt to provide techniques for the 
evaluation of each of these aspects of the pupil’s environment and 


background. 


PARENT-TO-PARENT RELATIONSHIPS 


The parent-to-parent relationships in the pupil’s home environ- 
ment have an importance in his adjustment and guidance that is 
easily appreciated. The evaluation of these relationships is, however, 
a difficult matter for the classroom teacher in most situations. This 
difficulty arises from the traditional aura of privacy which surrounds 
relationships between parents. How they feel toward each other, the 
sources of the major satisfactions and discontents in their living with 
each other, are usually considered to be strictly a “family matter” 
and none of the teacher's “business.” A pupil may be maladjusted 
in school and obviously suffering from insufficient parental stability 
and emotional security, but it is none the less hazardous for a teacher 
to take direct steps toward investigating interparental relationships. 
Any such efforts may easily be misunderstood as “prying” and lead 
to serious conflict between teacher and parents and even between the 
school and the community as a whole. 

Situations may arise in which it becomes almost imperative that 
some knowledge of interparental relationships be gained to de- 

426 


Environment and Background 427 


termine whether or not certain aspects of a pupil’s adjustment can 
be traced to some phase of the relationship between his parents. 
Frequently this knowledge may be gained through the common dis- 
cussions of one’s neighbors, friends, and associates which are known 
as “gossip.” The teacher who lives in a community and is associated 
with the pupils’ parents in matters of community concern will 
frequently be able to observe and evaluate these relationships 
through all the informal channels by which people ordinarily come 
to know one another. A friendly and receptive teacher who com- 
bines an interest in parents with a sense of professional responsibility 
can sometimes gain the parents’ confidence to such an extent that 
informal interviews with them will yield relevant information. 
Teachers who are friendly with parents, who have their confidence 
and respect, will obviously have more success in such evaluations. 

When it is desirable to make a direct attempt to acquire insight 
into interparental relationships, a higher degree of parent-teacher 
rapport than usually exists must first be established. The procedure 
for establishing this will vary from parent to parent, but its general 
nature has been indicated by Baruch (x). In studying the “coexist- 
ence” of reported tensions in interparental relationships with be- 
havior adjustment in young children, she ‘employed interviews 
similar to those used by psychiatrists. As a result of frequent con- 
tinuous contacts, parents felt at ease with the interviewer and were 
willing to give intimate details concerning their lives. They ob- 
served her close relationship with their children and her interest 
in them. In the interviews parents were encouraged to express not 
only intellectualized facts but also their feelings and emotions. An 
attempt was made to encourage them to gain insight into their own 
emotions on the ground that these greatly affected their children 
and that no emotion could be a justifiable cause of guilt feelings. 

The participation of parents in such interviews is best gained by 
explaining their purpose and their thoroughly confidential nature; 
The interviewer must make it clear that no confidences will ever be 
revealed to anyone. Furthermore, interviews should not adhere 
rigidly to any fixed schedule or outline. Rather, within wide limits of 
relevancy, parents should be encouraged to talk along any lines 
they choose in the freest and most natural manner possible. 

In encouraging interviewees to reveal material that might br 


428 How to Evaluate 


highly embarrassing or a source of feelings of shame and unworthi- 
ness, Baruch used the technique of reassurance on the basis of the 
commonness of similar facts in the lives of most people. The wide- 
spread occurrence of masturbation at various ages, of antagonism 
to parents and siblings, of childhood stealing, or of adolescent sex 
experimentation may all need to be pointed out to interviewees to 
keep them at ease and to maintain and improve rapport during the 
interview. : 

The kinds of tension which Baruch found significantly related 
to child maladjustment were mentioned in Chapter VII. Among 
them, the ones centering around sex relationships and ascendance- 
submission seemed to be pivotal. That is, these two kinds of tension 
more often produced other kinds of tension or projected themselves 
into the other areas. When sex and ascendance tensions were 
eliminated from the picture, the relationships between the other 
kinds of tension and child adjustment were much reduced. The 
types of child maladjustment to which various kinds of interparental 
tension were related were negativism, overdependence on adults, 
instability, antagonism or cruelty to other children, sleep difficulties, 
and temper. 

Recently an attempt was made to evaluate interparental relation- 
ships by means of a questionnaire which can be filled out by high 
school pupils. Myers (xo) prepared a set of questions one of whose 
six sections dealt with relations between parents. The other sections 
dealt with home membership, supervision, discipline, parent-child 
relations, and relations between children, The reliability of the com- 
plete questionnaire was .913 for the group of high school students 
used in Myers’ study. When the questionnaire was filled out by 
guidance workers in the school and social case workers in the com- 
munity in the light of their knowledge of the pupil’s family back- 
ground, there was complete agreement between the pupils’ answers 
and the raters’ markings in 85 per cent of the items. 


PanENT-cHILD RELATIONSHIPS 


Myers’ questionnaire is useful in evaluating not only interparental 
relationships but also several other aspects of a pupil’s family life. 
This brings us to the consideration of parent-child relationships. 


Environment and Background 429 


Many studies have been made of the influence of various parent-child 
relationships on children’s adjustments. Symonds (16) has classified 
these relationships along two dimensions: dominance-submission 
and acceptance-rejection. A parent’s fluctuation between acceptance 
and rejection is called ambivalence; fluctuation between dominance 
and submission is called inconsistency. 

Fitz-Simons (4) has prepared a guide for estimating parent at- 
titudes to be used in evaluating case studies of parent-child relation- 
ships. After an elaborate case study has been made on the basis of 
many interviews with parents and observations of actual parental 
behavior and home situations, the guide can be applied so as to 
provide a quantitative index of the degree to which a parental 
attitude toward a child is negative or positive. Interpretation of the 
items of parental behavior as indications of acceptance or rejection 
can be made only in the light of a set of “Key-points” furnished by 
Fitz-Simons. Whenever the majority of the items recorded in the 
body of the scale fall on the positive side but the Key-points indicate 
that the parent’s attitude is essentially negative, that attitude may be 
considered overprotection or concealed rejection. The Key-points 
provide an index of the emotional flavor of liking or disliking in 
which the parental behavior is immersed. In interviewing parents 
and pupils, during visits to the pupil’s home, and in talks with the 
pupil at school, the teacher can use the concepts provided by Fitz- 
Simons as central points around which to arrange an evaluation of 
the parents’ relationship to the child. 

The child’s attitude toward his parents may similarly be de- 
termined through interviews. As usual, rapport must be established 
in the friendliest, most mutually respectful terms possible if the 
interview is to have maximum success. Simpson (13) used a direct 
approach among others, asking the child, “Whom do you like best 
at home?” Meltzer (9) has used what he terms the Single Free 
Association Interview. He met each child individually and engaged 
in a preliminary conversation to further rapport. The child was then 
instructed about the procedure as a game that would be fun and 
take only a little time. Thus the’ experimenter would say a key- 
word and the child would “shoot out,” without holding back, the 
first ten ideas or words that came to his mind in connection with 
the given word, regardless of whether they were silly or embarrassing. 


430 How to Evaluate 


After some preliminary trials with innocuous words the pupil 
were given the word *mother" and then "father." 

By this technique Meltzer hoped to secure the fundamental under- 
lying attitude of the child toward his parents with a minimum of 
disguises and socially acceptable pretenses. As typical of the results 
obtained with his method, Meltzer presents the following: 


Fourteen-year-old Italian girl from low economic level school—I.Q. 
95. Father thirty-five years old; mother forty-four. Parents are Catholic 
and go to church regularly. Girl associates with family friends. Attended 
a settlement house and quit because considered children there too rough. 
Fairly popular in classroom. Cries easily if disappointed. Is leader in 
athletic activities in school. 


Reaction to Mother: Nice—gentle—kind—kind to animals—nice to 
people—polite—she’s not very good-looking—she likes comedians and 
she likes Russian music—she likes Polish music too,—she's not so nice 
sometimes (sometimes she has to scold you)—she doesn't like to go to 
shows—she doesn't care for nice clothes—likes to go riding (auto)—she 
likes to read stories. 

Reaction to Father: Rather good-looking—unselfish—likes to hunt and 
fish—likes to work—likes animals—likes dogs—he likes always to be out 
in the open air—he owns a clubhouse—he has picnics almost every 
Sunday—he likes to tinker with guns—he likes to give things—he likes 
to have good times—when he gets to scolding he scolds for rather a long 
time but that’s not so often, 


From such reactions to the words “mother” and “father,” teachers 
can acquire insight into a pupil’s attitudes toward his parents, the 
degree of his dependence on them, the parents’ characteristic role 
in his world, and similar determiners of his feelings of emotional 
security and personal worth. 

Classroom teachers do not usually possess the training in clinical 
psychology necessary for the best application of Meltzer’s technique. 
It may prove serviceable when used with children with whom a high 
degree of rapport has been established. The interpretation of the 
results obtained will be necessarily qualitative rather than quantita- 
tive but may be sufficient when the teacher’s main purpose is to 
discover whether a pupil’s attitudes toward his parents are at the 
root of some difficulty in adjustment. 


Environment and Background 431 


Stott (14) has constructed a questionnaire to measure for a group 
of rural pupils in Nebraska what he calls the “Family-Life Variable” 
or the pattern of confidence, affection, and companionability in the 
pupil's home life. Typical items in bis questionnaire are given below. 
The positive aspect of the Family-Life Variable is manifested. by 
adolescents who tend to say: 


1. They rarely assume the attitude expressed by "What my folks don’t 

^ know won't hurt them." 

2. They try out what their parents advise. 

3. They deserve the punishment they get. 

4. Every member of the family has "his say" in what the family does 
as a group. 7 

5. Their parents listen to their side when they disagree with the parents. 

6. Their parents sometimes admit that they are wrong. 

7. They like to do extra little things to please the other members of the 
family. 

8. They would enjoy being shut in with the family on a rainy day. 


The negative aspect of the variable is represented by pupils who 
admit: 


1. They think “Oh, what's the use” after trying to explain their con- 
duct to their parents. 

2. They “talk back” to mother. 

3. They get scolded for every little thing. 

4. Father resents it when they disagree with him. 

5. Other young folks seem to have more fun with their families than 
they. 

6. Their parents do things that make them appear foolish. 

7. Father nags and scolds. 

8. They have more fun away from home. 


Scores obtained by two children in the same family on a question- 
naire of 64 such items correlated with each other to only a very low 
degree, This indicates that family life may be a different matter for 
each of the children in a family, rather than a general objective 
home environment. That is, the responses of students to such ques- 
tions reveal their own personal adjustment to their parents and their 
family situations relative to themselves rather than absolute for the 
family as a whole. One of the children in a family may be well 
adjusted and derive emotional satisfaction from his relationships 


432 How to Évaluate 


with his parents while another child in the same family may see 
his family in a different light and be maladjusted in it. Thus it ap- 
pears that within the same general family situation there may be 
very different affective environments for different children in the 
family. 

Data concerning parent-child relationships must usually be evalu- 
ated informally by means of conversations with pupils and their 
parents and observations of the behavior of parents and children 
in such areas as have been indicated by the above illustrative items. 
Aggressive problem behavior of pupils may thus sometimes be 
traced to rejection by one or both parents. Excessively submissive 
or withdrawing behavior of pupils may similarly be traced to 
overaccepting or submissive parental attitudes toward them. In- 
consistent parental attitudes in either dominance-submissiveness or 
acceptance-rejection (ambivalence) also usually result in some form 
of child maladjustment. 


Cup-ro-cuitp RELATIONSHIPS 


Child-tochild relationships within the family frequently operate 
in conjunction with parent-child relationships to produce the par- 
ticular pattern of adjustment or maladjustment exhibited by a pupil 
in school. Frequently, in investigating the parents’ attitudes toward 
a pupil, it will be found that some insight into the parents’ relation- 
ships to the other children in the family and of the latter to each 
other is necessary in order to make a given pupil's mode of adjust- 
ment more understandable. The relationships of the parents to the 
other children in the family can be evaluated by the techniques 
discussed above, The same general method of informal questioning 
and observation may be used, either well concealed or based on 
thorough rapport between teacher and family. 

In general, it is not sufficient to ascertain merely that a pupil is an 
only or an intermediate or a youngest child. The factor of “only-ness,” 
or position in order of birth, has been found to have a less pre- 
dictable effect upon the personality of the individual than the more 
directly psychological relationships between children. As was pointed 
out in Chapter VII, such relationships between children as hero 
worship, shame, jealousy, and parent-child substitution may prove 


Environment and Background 433 


to be fruitful hypotheses or "things to look for" in investigating the 
relationship between children in the same family. 

Other types of relationships between siblings were studied by 
McFarland (8); these included direction, submission, resistance, 
imitation, rivalry, sympathy, protection, helping, giving and lend- 
ing, and affection. The term "direction" designates attempts by 
one child to control the activities of the other or to gain his submis- 
sion in the advancement of interests in which the other might or 
might not share. “Submission” is the acceptance of direction through 
either active adjustment or acquiescence. “Resistance” is the opposite 
of submission; it includes active rejection of direction, ignoring of 
direct commands, or opposition to the furthering of interests ad- 
vanced by the other child. Under “rivalry” are classified attempts 
on the part of one child to equal or excel the other in such ways as 
the possession of materials, physical size, skill, the attention or re- 
gard of another person. “Sympathy” is defined as concern by one 
child for the distress or discomfort of the other. “Protection” is the 
attempt by one child to shield the other or his belongings or in- 
terests from harm. “Helping” is considered as any effort on the part 
of one child to further the interests of the other when those in- 
terests are not shared by the first. “Giving” is considered as any 
incident in which one child places things that are in his present 
possession at the disposal of the other. Observable “affection” in- 
cludes both such physical advances as patting, hugging, and kissing, 
and such verbal expressions as calling the other child love names 
and making endearing statements. 

Whenever it is possible to observe two children of the same 
family together it may be possible to classify the relationship be- 
tween them in one or more of these categories and to draw in- 
ferences from this relationship regarding the adjustment of a pupil. 
Frequently, however, such inferences may be obtained merely from 
conversations with a pupil about brothers and sisters and from 
known facts concerning his educational achievement, physical status, 
mental abilities, emotional and social adjustment, and the attitudes 
of his brothers and sisters. If a pupil has a sibling who is markedly 
superior or inferior in one or more of these aspects, these facts may 
be inserted into the total picture of his personality when it is in- 
terpreted as a basis for guidance. The pupil whose brother has 


434 How to Evaluate 


preceded him in school and been a star athlete and the valedic- 
torian of his high school class may, for example, easily set goals 
for himself that are too high in the light of his own abilities. Feel- 
ings of inferiority and discouragement may result from such a rela- 
tionship. A pupil’s adjustment to the opposite sex may similarly 
be related to the presence or absence of siblings of the opposite sex 
in his own family. The boy who is surpassed by his younger brother 
in gaining academic honors or other forms of social approval; the 
girl with very masculine brothers who becomes a tomboy; the boy 
who expresses like for or disinterest in a vocation or curriculum be- 
cause of his older brother’s favorable or unfavorable experience 
with it—these relationships and many others between children in 
the same family may frequently provide the major clue to under- 
standing adjustment problems in providing guidance for an in- 
dividual pupil. The literature of psychology gives no hard and fast 
rules for the interpretation of such family relationships. Rather the 
teacher must use all his ingenuity, powers of observation, and ability 
to establish rapport in ascertaining the facts of these relationships 
and interpreting their unique meaning in the adjustment of the in- 
dividual pupil. 


Socio-Economic STATUS 


The socio-economic status of a pupil’s family may be evaluated 
either on the basis of a single index or, more adequately, through the 
inyestigation of a larger number of carefully selected indexes. In 
small communities or in rural schools, the socio-economic status 
may be so homogeneous and easily identifiable on the basis of the 
teacher’s general knowledge of the community that any formal 
investigation is unnecessary. When this is not the case, a teacher 
can ascertain this only by an explicit attempt to secure relevant 
information. The most frequently used single index of socio-eco- 
nomic status is father’s occupation. Various ways of classifying 
occupations have been used for this purpose. These classifications 
should be distinguished from those based upon grouping according 
to skills required, which is used in vocational guidance. Occupational 
classifications according to socio-economic status deal not with the 
types of abilities required, but with the social prestige, income, or 
general "goodness of living" which usually follows in the wake of a 
given paternal occupation. Beckman (2) has presented a classifica- 


Environment and Background 435 


tion of occupations according to the intelligence, skill, and educa- 
tion or training required as well as the socio-economic prestige given 
to them. For each of the following five grades Beckman listed 
representative occupations of which we shall present only the first 
three: 


I. Unskilled manual occupations 
Farm laborers 
Lumbermen, raftsmen, and woodchoppers 
Laborers (construction, manufacturing, road, warehouse, etc.) 
II. Semi-skilled occupations ; 
Fishermen and oystermen 
Mine operators 
Filers, grinders, buffers 
IIIA. Skilled manual occupations 
Farm owners and tenants 
Apprentices to building and other skilled trades 
Bakers 
IIIB. Skilled white-collar occupations 
Freight and express agents 
Mail clerks and carriers 
Radio, telegraph, and telephone operators 
IVA. Subprofessional occupations 
Opticians 
Undertakers 
Actors and showmen 
IVB. Business occupations 
Owners and proprietors of garages, truck, and cab companies 
Conductors (steam railroad) 
Postmasters 
ÍVC. Minor supervisory occupations 
Farm managers and foremen 
Mine foremen and overseers 
Manufacturing foremen and overseers 
VA. Professional occupations (linguistic) 
Authors, editors, and reporters 
Clergymen 
College presidents and professors 
VB. Professional occupations (scientific) 
Architects 
Artists, sculptors, and teachers of art 
Chemists, assayers, and metallurgists 


436 How to Evaluate 


VC. Managerial and executive occupations 
Owners and managers of log and timber camps 
Mine operators, managers, and officials 
Manufacturing managers and officials, and manufacturers 


It is evident that this ranking is by no means perfectly correlated 
with financial income. Rather, the cultural, aesthetic, and social 
prestige of the occupation is the major criterion. The income of 
many skilled workmen is greater than that of some professional 
workers such as clergymen or teachers; but the general social 
prestige, the cultural and aesthetic status, of these latter occupations 
is usually considered superior to that of skilled workmen. In general, 
the advantages enjoyed by pupils whose fathers rank high in this 
classification enable them to attain superior average status in educa- 
tional achievement and adjustment. 

Income as an index of socio-economic status must, of course, take 
into account the size of the family and the age and sex of its 
members. Sydenstricker (15) included these factors in his *Am- 
main" scale. Each member of the family is assigned a number of 
units based upon sex and age, and the sum of the units for all the 
members of the family is obtained. The monthly family income is 
then divided by this total to obtain the index of socio-economic 
Status. 

Other single indexes of socio-economic status are such possessions 
as a telephone, an automobile, a vacuum cleaner, or a bathtub. The 
inadequacy of these single indexes of the home environment for the 
understanding of an individual pupil’s background is self-evident; 
homes which are distinguished from one another by one of these 
factors almost inevitably overlap to a large degree with respect to 
their possession of other desirable environmental factors. For ex- 
ample, homes in which there is a telephone differ widely from 
one another in parental occupation, family income, cultural ad- 
vancement, and aesthetic sensitivity. Some of them may rank below 
homes without telephones in these various aspects of the home en- 
vironment. This same overlapping holds true for other single in- 
dexes; families with the same paternal occupation or family in- 
come index may differ widely from one another in other aspects of 
the home environment. 


In order to overcome the shortcomings of lack of precision and 


Environment and Background 437 


inadequate coverage of the total scope of factors affecting a family's 
socio-economic status, various methods employing multiple measures 
of environment have been devised. Of the many devices for this pur- 
pose, only the three most recent and highly refined scales will be 
considered here: 

1. The Minnesota Home Status Index 

2. The Sims Score Card for Socio-economic Status 

3. The Kerr-Remmers American Home Scale 

The Minnesota Home Status Index" requires an interview with 
an adult in the household, that is, with one of the parents. Ques- 
tions are asked of him, and his responses are recorded on the inter- 
view blank. The index covers 50 items grouped into six classifica- 
tions: 

. Children’s facilities 

2. Economic status 

3. Cultural status 

4. Sociality status 

5. Occupational status 

6. Educational status 
The scale. was carefully constructed and evaluated by statistical 
techniques. Its reliability was found to be .92 by the split-half 
method; this is sufficiently high for the measurement of individual 
homes. Its chief disadvantage is the fact that it requires that the 
interviewer visit the home; it can therefore be used to evaluate only 
one home at a time. Each interview will probably consume at least 
half an hour. When it is desirable to measure the socio-economic 
status of a large number of pupils, the burden of time and effort 
upon an individual teacher becomes so great as to render this 
method impracticable. But when the pupils are too young to be able 
to read and answer such paper and pencil questionnaires as are 
described below, the Minnesota Home Status Index is the best 
available technique. 

The Sims Score Card for Socio-economic Status? is a group socio- 
economic paper and pencil questionnaire applicable to the homes 
of children in the fourth grade or higher. The 23 questions are 
selected so that they can readily be answered by elementary school 


1Published by University of Minnesota Press, Minneapolis, Minnesota. 
? Published by Public School Publishing Company, Bloomington, Illinois. 


n 


438 How to Evaluate 


pupils and they cover various physical possessions of the home, 
social participation, and father's occupation. Reliability was found 
to be .9r by the split-half method; a further check on this was the 
correlation coefficient of .95 obtained between the scores of 100 pairs 
of siblings. This shows that the scores from responses made inde- 
pendently by two children in the same family tended to agree very 
closely. The validity of the scale was found to be adequate in terms 
of ability to discriminate between two extreme occupational groups, 
the professional and the day-labor groups. Similarly the scale dif- 
ferentiated very well between two selected neighborhoods known 
to be widely divergent in average socio-economic status. Despite 
its high reliability and its demonstrated validity for these rough 


types of discrimination, the Sims Scale has several disadvantages. ' 


In the first place, it samples too few aspects of the home environ- 
ment, these being mainly the cultural and economic aspects; the 
aesthetic desirability of the home is almost entirely neglected. In 
the second place, the scale is difficult to score because the responses 
are scattered over the page rather than uniformly placed in the 
margin. That a high degree of community of function exists be- 
tween the Sims Score Card and the Minnesota Home Status Index 
is indicated by the coefficient of correlation between them of .94, 
but the latter’s greater range of items and the six subscores which 
are obtainable with it provide a more analytical picture of the factors 
related to socio-economic status than the Sims Score Card does. 
The Kerr-Remmers American Home Scale? attempts to combine 
the advantages of the Sims group testing method with the compre- 
hensiveness and analytical values of the Minnesota Home Index. 
It contains 50 items answerable by any pupil in the sixth grade or 
above. The items are classified into cultural, aesthetic, economic, 
and miscellaneous sections. This classification was made by a type 
of factor analysis so as to obtain meaningful clusters of items while 
reducing the intercorrelations of the clusters to a minimum; mini- 
mizing the intercorrelations serves to make the scores on the four 
parts as independent as possible and to increase the diagnostic value 
of the scale. By using statistical analysis in conjunction with sub- 
jective opinion concerning the significance of each item, it was 
possible to insure that the actual interrelationships of the items, 
3 Published by Science Research Associates, 1700 Prairie Ave., Chicago. Illinois. 


Envitoninent and Background 439 


as well as their surface meaning, would influence the grouping. In 
this respect the grouping is superior to that of Leahy, who relied 
solely on the judgment of competent persons. Although the correla- 
tion between the Kerr-Remmers Scale and the Sims Score Card was 
found to be .92, the unique value of the former remains considerable 
in the light of its broader range of items, its part scores for the 
breakdown of general socio-economic status, and its greater scorabil- 
ity by virtue of a more efficient arrangement of the items. 

The correlation with American Home Scale scores of 60 pairs of 
siblings was .84. Reliability by the split-half method was found to 
be .89, and by the Kuder-Richardson formula .91. The correlation 
coefficient between scores on the Minnesota Home Status Index and 
the Kerr-Remmers Scale based on only 21 cases was .92, indicating 
substantial agreement between the individual and the group meth- 
ods of measuring socio-economic status. The following are typical 
items from the Kerr-Remmers scale: 


SECTION I. CULTURAL 


2. Do either of your parents belong to a Parent-Teachers 
organization? 2.10.2... ese reper erm YES NO 
8. Are any of these magazines regularly taken in your 
home? Which ones? Place a (4/) check mark before 
each magazine regularly taken in your home: 


American Home Look 


... American Magazine McCall's 


American Mercury ......., Saturday Evening Post 


Argosy "Saturday Review of Literature 
Asia . School and Society 
Be: Life Scientific American 
des Liberty ........ Scientific Monthly 
__ Living Age 


(Eighty-five additional magazine titles are contained in this item.) 


9. DID YOUR FATHER: 


A. Ever attend school? ......... ees YES NO 
B. Enter high school? ...........- 066 YES NO 
C. Finish high school? .....ees YES NO 
D. Enter college? Lem YES NO 
E. Finish college? -n ACi eee YES NO 


440 How to Evaluate 
11. Does your family employ any hired help such as a cook, 


maid, ORIDHHern Josse rdcquaret nv waren d YES NO 
13. To how many daily newspapers docs your family 

BUDS T PA T E Lad ria epe e 0, I, 25 3, 4) 5 6,7 

SECTION II. AESTHETIC 

14. Are there any trees in front of your home? .......... YES NO 
17. Does your family own (not rent) the home in which 

Mee Mes pe ne: Loo Cy Noe MARNE coro d ond YES NO 
19. Does your family keep a book or album of family 

PROCORIVDDE NE SE RU S A YES NO 


SECTION III. ECONOMIC 
DO YOU HAVE IN YOUR HOME: 


BGs VACUURT CONSE recu iE e me E YES NO 
26. Washing machine? (Answer "YES" if your family 

sends its laundry outside the home to be done) ....... YES NO 
29. Does your family have an automobile? .............. YES NO 


34. What is your father's usual occupation? WRITE OUT 
exactly what he does for a living, such as farmer, tool- 
maker, die-maker, paperhanger, truck driver, grocery 
clerk, WPA, bakery-oven attendant, carpenter, etc, 


(please fill in) 


SECTION IV. MISCELLANEOUS 
36. Has your father been a member of any study, art, 
service, or political club as Chamber of Commerce, 
Rotary, Jacksonian Club, Lincoln Club, etc.? ......... YES NO 
42. Have you had paid lessons in dancing, dramatics, ex- 
pression, elocution, art, or music outside of school? YES NO 


For the rapid and efficient evaluation of the socio-economic status 
of large numbers of pupils in the upper elementary and secondary 
school grades, the Kerr-Remmers American Home Scale constitutes 
a highly practicable technique. Although the scale has been stand- 
ardized only on urban homes, it is probably applicable to rural 


homes with few if any modifications, It will, however, probably be . 


necessary to provide norms for rural homes before the scale can be 


Environment and Background 441 


used with them. Similarly, the value of the particular items in any 
instrument that measures socio-economic status may vary from time 
to time as social customs and technological facilities change. Fire- 
places in the home may eventually become signs of cultural retarda- 
tion rather than cultural advantage. Magazines may keep their 
names but change in content so as to affect their cultural value. 
Similarly the value of such items as “Central heating system?” or 
“Does your family leave the city every year for a vacation?” may 
vary widely in different geographic sections of the country. Such 
considerations, however, reduce the meaningfulness of the scores 
obtained to only a slight degree. Within ordinary limits of time and 
geography, the scale: provides evaluations of real significance. 


THe COMMUNITY 


Thorndike (x7) has provided a list of items by which the “general 
goodness of life for good people” in American cities may be 
evaluated. A reading of his engrossing and enlightening little book, 
Your City, is necessary for a more complete and critical understand- 
ing of his technique; here, however, it is only possible to describe 
his general approach and to make available the “Ten-item City 
Yard-stick” for measuring the general goodness of one’s own city. 
In the first place, Thorndike demonstrates that American cities 
differ from one another widely in such factors as retention of boys 
and girls in school, number of domestic installations of gas per 100 
inhabitants in 1930, number of telephones per roo inhabitants in 
1930, number of radios per roo inhabitants in 1930, infant death 
rate for 1926-1934, death rate from typhoid fever for six years near 
1930, and many other items which may be considered related to 
health, creature comforts, education, recreation, social decency, 
literacy, and similar factors. The full meaning of “general good- 
ness” of living in a community may, of course, be obtained only 
from the specific items which enter into Thorndike’s measurements. 

Thorndike has measured 310 cities with an index consisting of 
37 items carefully selected from an original group of 297. Selecting 
the 200 cities remaining after the exclusion of (1) residential suburbs 
and other cities adjoining much larger cities, (2) the cities of the old 
South, and (3) the giant cities, he found that “the goodness of life 


442 How to Evaluate 


in a city is explainable only in part (about one-fourth) by wealth 
and income. ... The goodness of life in a city has deeper roots 
than its present wealth and income” (17 : 62). 

On the other hand, the personal qualities of the residents, such 
as the number of persons per 1000 inhabitants graduating from 
public high schools, or the percentage of literacy in the total popula- 
tion, or the per capita number of deaths from syphilis (reversed), 
explained about 43 per cent of the goodness of life in a city. That 
is, “cities are made better than others in this country primarily 
and chiefly by getting able and good people as residents. . . . 
The second important cause of welfare is income” (17 : 67). 

Of Thorndike’s 37 items, many cannot be obtained conveniently. 
In order to enable. ordinary citizens to secure a fairly accurate 
measure of the general goodness of their own city, Thorndike has 
provided the following Ten-item City Yard-stick. These ten items 
can be obtained for “almost any city in a few hours, and . . . will 
tell fairly well how the city stands in its general goodness.” 


TEN-ITEM CITY YARD-STICK 


Item 1. Get from the health officer of your city the infant death rate, 
that is, number of deaths per year of infants 1 to 365 days old per 1000 
live births. Subtract this number from 120, and multiply the result by 2. 

Ttem 2. Get from the city-treasurer the year’s expenditures for the 
operation and maintenance of parks, playgrounds and other means of 
recreation, that is, the figure he would report to the census authorities as 

Government-cost payments for operation and maintenance of the de- 
partment of recreation.” Divide this amount by the estimated population 
of the city, and take ten times the quotient expressed as dollars. For 
example, if the amount is $46,350,000 and the population is 60,000, the 
quotient is $0.7721, and ten times it is 7.7 (or 8 to the nearest whole 
number). 

„Item 3- Get from the city-treasurer the estimated value of all the 
city’s property in the form of schools, libraries, museums, parks and 
other recreational facilities. Divide this amount by the estimated popula- 
tion of the city; then multiply the result expressed in dollars by 1.25. 

Item 4. Get from the city-treasurer the total value of all public 
property (exclusive of streets and sewers), both that (such as schools, 
fire engines, and jails) used for municipal services, and that (such as 
water-works, docks and power plants) used for public-service enter- 


Environment and Background 443 


prises. Get also the net public debt, subtract the latter from the former, 
then divide by the population. Enter a credit of x for every $3 per capita 
excess of property over debt. In case your city owes more than its 
public property is worth, enter the appropriate negative number. 

Item 5. Get from the city-treasurer or from the superintendent of 
schools the expenditures for the operation and maintenance of schools. 
This does not include capital outlays or payment of interest on school 
debts. Divide this amount by the population. Multiply the number of 
dollars in the quotient by 2. That is, enter a credit of 1 for every 50 cents 
per capita spent for teachers' salaries, books and supplies, heat, light and 
care of the schools, etc. 

Item 6. Get from the superintendent of schools the number of persons 
who graduated from senior high school during the year, and divide 
this number by the city's population. Multiply the quotient by 14141. 
This is equivalent to giving a credit of 1o for every 7 graduates per 
10,000 population. 

Item 7. Get from the person in charge of the public library the circula- 
tion of books as he would report it to the American Library Association. 
Divide this number by the city’s population. Multiply the result by 5. 

Item 8. Get from the superintendent of schools the number of pupils 
in school who were aged 16 years o months to 17 years 11 months at the 
date when the school enrollment was taken. Find what per cent this 
number is of the estimated number of persons 16 years o months to 
17 years 11 months living in the city at that date and give a credit of 
1 for each per cent. 

Item 9. Get from the superintendent of the telephone company the 
number of subscribers, or estimate the number by counting the names on 
30 pages taken at random from the phone book. Multiply the number 
of phones by 3000, and divide the product by the city’s population. That 
is, give a credit of 1 for every three phones per thousand population. 

Item 10. Get from the electric light company the number of homes 
that are supplied with electricity. Multiply by 200 and divide by the city’s 
population. That is, give a credit of 2 for each domestic installation of 
electricity per hundred population. 

Sum the ten entries to obtain your city’s total score. . . « 

The Ten-item Yard-stick scores in 1930 for the cities over 30,000 run 
from about 300 to about 1000. The average was about 575; about 10 per 
cent were below 400 and about ro per cent were above 750. 

Some study of cities from 10,000 to 29,000 in population indicates that 
the following adjectives are appropriate for their scores in the 10-item 
City Yard-stick: 


444 How to Evaluate 


200-350. Far below the American standard 

351-500. Inferior 

501-650. Ordinary 

651-800. Superior 

801-950. In the class of Evanston, Glendale, Newton, Oak- 
land, Springfield, Mass., Grand Rapids and the like 

95r or more. Among the world's highest r per cent. 


The Ten-item City Yard-stick can provide teachers with a better 
idea than is otherwise obtainable of the kind of community in 
which they are teaching and in which their pupils live. Other notions 
of the general goodness of a community can be obtained by averaging 
the scores of large numbers of pupils on scales of socio-economic 
status. Its general nature is thus described in terms of averages of the 
units, components, or individual families of which it is composed. 
'The community within a community which is represented by an 
individual school or classroom may similarly be described by the 
average socio-economic status of the homes from which its pupils 
come. 

In becoming acquainted with the community, the neighborhood, 
or the broad segment of environment in which pupils live, the 
teacher may, of course, use all the informal approaches that are 
available to any socially sensitive citizen. Talking with one's neigh- 
bors, with the parents of pupils, with “the man on the street,” will 
provide valuable insights into the political, economic, and social 
structures of the community. The political corruption or decency 
of a town or city, the real centers of political power, whether these 
be in the hands of a local manufacturer, the Chamber of Com- 
merce, or any other non-governmental group, can thus be ascer- 
tained. The local economic structure, the dominant industry, the 
rising and declining branches of economic enterprise, the population 
trends and shifts, the transportation facilities can all readily be 
studied and evaluated by a teacher who is an alert member of the 
community. The local newspaper may be a valuable index to the 
attitudinal make-up of the community, its liberalism or conservatism. 
Its attitudes toward school policies and curricula should be closely 
followed by the teacher who is concerned with the whole pupil. 
Only by living in and with the community can he understand the 
major sources of pupil and parent attitudes, The religious composi- 


Environment and Background 445 


tion of the community, whether it be Catholic, Protestant, Jewish, or 
any other faith, must similarly be understood by the teacher as a 
part of the total picture of the socio-economic environment and 
background. 

In evaluating the community in which his pupils live the teacher 
should not disregard the total society, the world of which all 
communities and nations are a part. War and peace, economic 
prosperity and depression, political stability and chaos all have 
their effects on pupils and must be comprehended in the total 
picture of the pupil's environment and background. The reper- 
cussions of. world events on one's own community should be ob- 
served. and interpreted in relation to pupil adjustment. Especially 
in communities of heterogeneous national origin or racial composi- 
tion should the effects of international relations be evident. 

Population movements within the nation, as exemplified in 
Steinbeck's novel, The Grapes of Wrath, may introduce further 
conflicts within a community. Migrants and their cultures may 
require special understanding by a teacher both for their own worth 
and for their effect upon and relationships with the more permanent 
members of the community. 

Similarly, class-consciousness within a community needs to be 
appreciated and evaluated if the social adjustment of pupils is to be 
fully evaluated. To what extent is there a feeling of difference, a 
social distance, between rich and poor? How homogeneous is the 
socio-economic status of a group of pupils in a given school and to 
what extent does class-consciousness facilitate or hamper the social 
interaction and adjustment of any pupil or group of pupils? 

Goodykoontz (6) has developed a brief manual for the use of 
parents and teachers in studying the community in which their 
school system functions. Valuable points of view, considerations, 
questions, and sources of answers are provided for use in seeking an 
understanding of the community. The topics considered are (1) size 
of the community, (2) location, (3) history, (4) the people, (5) mak- 
inga living, (6) community organization and government, (7) com- 
munity health, (8) recreation and cultural opportunities, (9) hous- 
ing, and (zo) welfare services. The references, bibliography, 
suggestions for investigation and discussion, and general descrip- 
tions given for each of these topics will provide invaluable help 


446 How to Evaluate 


to any teacher who wishes to understand and evaluate the com- 
munity in which his pupils live as a basis for improving his under- 
standing and guidance of the pupils themselves. 

Other pamphlets in the Know Your School Series published by 
the United States Office of Education are entitled “Know Your 
Board of Education,” “Know Your Superintendent,” “Know Your 
School Principal,” “Know Your Teacher,” “Know Your School 
Child,” “Know How Your Schools Are Financed,” “Know Your 
State Educational Program,” and “Know Your School Library.” 
These pamphlets (obtainable from the Superintendent of Docu- 
ments, Washington, D.C., at prices ranging from five to ten cents 
each) provide brief but practicable suggestions for procedures to 
be used by parents or teachers in understanding and evaluating 
various aspects of the schools. 


ScHooL ENVIRONMENT AND BACKGROUND 


Various techniques have been presented for measuring state and 
local school systems. These techniques provide means by which the 
teacher or school administrator can evaluate the school system as a 
part of the pupil’s environment and background and as a factor in 
pupil adjustment. Probably the most adequate of these is the Na- 
tional Education Association’s self-suryey plan for state school 
systems. This plan provides a group of fifteen check lists for the 
following phases of state school programs: (1) attendance laws and 
their enforcement, (2) child labor laws, (3) pupil personnel and 
adjustment, (4) teacher training and supervision, (5) certification 
of teachers, (6) employment and contracts, (7) state salary laws, 
(8) professional organizations, (9) teacher retirement systems, 
(10) sources of school revenue, (11) apportionment of state school 
funds, (12) material equipment, (13) state board of education, 
(4) state department of education, (15) adult and higher educa- 
tion. 

Noble (12) describes the techniques for measuring a local school 
system in terms of its historical background, administrative phases 
and programs, financial condition, instructional situation, and school 
plant. Individual teachers and school administrators may well join 
forces for such self-evaluation in rounding out the picture of the 
pupil’s environment as related to his adjustment, 


Environment and Background 447 


The curriculum of the school should be examined for the extent 
to which it serves individual and social needs. Whether the subjects 
are designed to further the development of pupils as useful citizens 
of a democracy and are adjustive in the sense of being related to 
their problems should also be determined. The curricular aspect of 
the pupil’s school environment may frequently be a source of malad- 


- justment or, conversely, a potent means of furthering desirable ad- 


justment. 

Bruner (3) has published a set of criteria which have been used 
in evaluating about 40,000 courses of study. The following four 
groups of criteria are presented: (1) Philosophy—social philosophy, 
educational philosophy, and principles of learning; (2) Content— 
authenticity, utility, adequacy, significance, and organization; (3) Ac- 
tivities—pupil purposing, interests and needs, social values, reality, 
variety, approach, and culminating activity; and (4) Evaluation of 
pupils’ work—purpose, variety, validity, areas of growth, and inter- 
pretation, Frederick (5) has compiled a valuable summary of the 
literature on curriculum development and evaluation. 

Of all aspects of the school environment, the pupil’s teachers are 
probably the most crucial factors in adjustment. For this reason a 
separate chapter is devoted to teacher evaluation. When the teacher 
evaluates himself, he lays bare a significant factor in the pupil's 
school environment as well as his own virtues and shortcomings. 
To this aspect of the pupil, namely, his “teacher environment,” we 
turn in the following chapter. 


SUMMARY 


Personal relationships within the home can be evaluated through 
various informal channels such as interviews with parents, pupils, 
and neighbors. Observational and questionnaire techniques may 
also be used. The factor of rapport is crucial in determining the 
validity of these evaluations. Socio-economic. status can be evaluated 
by means of scales designed especially for this purpose. Thorndike’s 
Ten-item City Yard-stick provides a usable technique for deter- 
mining the general goodness of life in a city. Other less formalized 
techniques are also available for acquiring insight into the nature of 
community environment and background. For evaluating state 
and local school systems and curricula, various check lists may be 


448 How to Evaluate 


used. 'The most important aspect of school environment, the teacher, 
is considered at greater length in the following chapter. 


QUESTIONS 


1. What is your answer to the argument that the teacher's concern 
with personal relationships within the home constitutes an undemo- 
cratic interference in private affairs, an invasion of the citizen's 
"castle"? 

2. If you were unable to establish rapport with a pupil's parents and yet 
strongly suspected that home relationships were contributing greatly 
to his maladjustment, where would you seek assistance in evaluating 
and improving his home environment? 

3. How could the evaluation of the community be used as an instruc- 
tional device in the social studies? Should inferences be drawn from 
it concerning effects on pupil health, education, and adjustment? 

4. In relating pupil adjustment to pupil environment, how ¢an you 
explain differences in adjustment between pupils who ostensibly have 
the same environment? 

5. Describe two pupils you have known whose environment and back- 
ground were widely divergent. What differences between these two 
would you attribute to the differences between their environments? 
Emphasize especially differences in adjustment and attitudes. 

6. Describe a case you have.experienced of two pupils in the same family 
who differed widely in various respects. Could their home environ- 
ment be considered the same for both? Explain. 


REFERENCES 


1. Baruch, Dorothy W., “A study of reported tension in interparental 
relationships as co-existent with behavior adjustment in young chil- 
dren,” Journal of Experimental Education, 6 : 187-204 (1937). 

2. Beckman, R. O., “A new scale for gauging occupational rank,” 
Personnel Journal, 13 : 225-233 (1934). 

3. Bruner, H. B., “Criteria for evaluating course of study material,” 
Teachers College Record, 39 : 107-120 (1937). 

4. Fitz-Simons, Marion J., Some Parent-Child Relationships as Shown 
in Clinical Case Studies, New York: Bureau of Publications, 
Teachers College, Columbia University, Contributions to Education, 
No. 643, 1935. 


IO. 


II. 


12. 


13. 


14. 


15. 


16. 


17. 


Environment and Background 449 


Frederick, O. I., “Curriculum development,” Encyclopedia of Educa- 
tional Research, New York: The Macmillan Company, 1941, 
PP- 373-385- 

Goodykoontz, Bess, Know Your Community as a Basis for Under- 
standing the Schools’ Problems, Washington: U.S. Office of Educa- 
tion, Federal Security Agency, Know Your School Series, Leaflet 
No. 57, 1941. 

Leahy, Alice M., The Measurement of Urban Home Environment, 
Minneapolis: University of Minnesota Press, 1936. 


. McFarland, Margaret B., “Relationships between young sisters as 


revealed in their overt responses,” Journal of Experimental Educa- 
tion, 6 : 173-179 (1937). 

Meltzer, H., “Sex differences in parental preference patterns,” 
Character and Personality, 10 : 114-128 (1941). 

Myers, T. R., Intrafamily Relationships and Pupil Adjustment, New 
York: Bureau of Publications, Teachers College, Columbia Uni- 
versity, Contributions to Education, No. 651, 1935. 

National Education Association, Research Bulletin, Washington: 
Research Division, National Education Association, March-May, 
1930. 

Noble, M. C. S., Practical Measurements for School Administrators, 
Scranton: International Textbook Company, 1939. 

Simpson, M., Parent Preferences of Young Children, New York: 
Bureau of Publications, Teachers College, Columbia University, 
Contributions to Education, No. 652, 1935. 

Stott, L. H., “Parent-adolescent adjustment, its measurement and 
significance,” Character and Personality, 10 : 140-150 (1941). 
Sydenstricker, E., and King, W. L, “The measurement of the 
relative economic status of families,” Quarterly Publications of the 
American Statistical Association, 17 : 842-857 (1921). 

Symonds, P. M., The Psychology of Parent-Child Relationships, 
New York: D. Appleton-Century Company, Inc., 1939- 

Thorndike, E. L., Your City, New York: Harcourt, Brace & Com- 


pany, Inc., 1938. 


CHAPTER XIX 


: 
| The Teacher 
i 
| 


| 
| 


“COMMON SENSE” AND THE SCIENCES OF PSYCHOLOGY AND PSYCHIATRY 
agree that the teacher is the most important factor in the school 
situation with respect to the adjustment of the pupil. This is true 
whether one conceives of the teacher’s functions in the narrow sense 
of teaching only subject matter or in the broader sense of the school’s 
concern with the whole pupil personality—physically, intellectually, 
emotionally, and socially. The pupil’s learning is conditioned not 
only by “how much the teacher knows” (subject matter) but also to 
a considerable extent by the personality traits of the teacher and 
the psychological relationships existing between teacher and pupils. 

Since the teacher plays so important a role in the determination 
of the various aspects of pupils with which this volume is concerned, 
any thoroughgoing evaluation of the pupil to secure data upon 
which to base his guidance must include evaluations of the teacher. 
In Chapter VII we saw that the background and environment of 
pupils includes the family, the community, and the school. In 
Chapter XVIII we presented procedures and techniques which may 
be used in the evaluation of all these factors except the teacher. The 
present chapter will discuss evaluation of the teacher considered as 
a part of this background and environment, That is, we are con- 
cerned with this evaluation in so far as the teacher constitutes a 
major influence upon pupils and hence must be understood if a 
complete and thorough understanding of the pupil himself is to be 
achieved, 

The importance of the teacher in pupil learning and adjustment 
derives from his position as the major channel for the transmission 
of the facts, knowledge, skills, attitudes, and ideas of our culture. 

450 : 


The Teacher 451 


Tt is his function to change pupils in the direction of greater satura- 
tion with the desirable elements of the culture in which they live. 
"his role endows teachers with an authority which increases their 
power to mold pupil personality. Pupils spend more of their waking 
hours with teachers than with any other adult, including parents. 
"They are probably more imitative of their teachers, more susceptible 
to their influences for good or evil, than is true of any other adult 
other than their parents. Consequently, if a teacher is to under- 


- stand a pupil, he must understand himself and, if possible, the other 


teachers under whose influence the pupil has come. It is to enable 
teachers to acquire an understanding of their influences upon pupils 
that the present chapter is intended. 


‘TEACHER SELF-EVALUATION 


How can teachers be evaluated? Our answers to this question, 
since this book is intended mainly for teachers, will be made pri- 
marily from the point of view of the teacher evaluating himself 
rather than being evaluated by school superintendents, principals, 
or supervisors. This point of view will mean that this evaluation 
becomes essentially a democratic process of self-supervision based 
on the assumption that the teacher is eager for self-improvement 
rather than that supervision from above is necessary to motivate 
him toward improving his role in guiding pupil development. 
Most of the techniques described below can be applied either by 
administrators and supervisors or by teachers themselves. The tech- 
niques presented have been more frequently used by administrators 
than by teachers. But the major emphasis in the discussion is the 
potentialities of these techniques for self-evaluation. 


APPROACHES TO TEACHER ÉVALUATION 


The approaches to teacher evaluation may be of two kinds. One 
evaluates the merit of teachers on the basis of the changes they 
produce in pupils. This approach is simply a reflection of the 
primary aim of all educational endeavor, namely, to effect desirable 
changes in pupils. The merit of any other instrumentality of educa- 
tion, be it a textbook, a school building, a principal, a curriculum, or 
a teaching method, may similarly be evaluated in terms of its 
effectiveness in producing these changes. The other approach to 


452 3 How to Evaluate 


teacher evaluation proceeds in terms of any aspects of teachers which 
may be considered related to their effectiveness in producing de- 
sirable changes in pupils. Among the aspects which have been con- 
sidered thus related are all those which have been discussed in terms 
of pupils in Part I of this volume. The teacher’s own scholastic 
achievement, physical aspects, general and special mental abilities, 
emotional and social adjustment, attitudes and interests, and socio- 
economic environment and background may all be evaluated when 
it is desired to estimate his effectiveness in producing desirable 
changes in pupils, 


Tracer Evacuation Basen on Purr. CHANGES 


General Nature.—Let us first consider in more detail the nature 
of the “pupil-changes” approach to the evaluation of teachers. This 
approach involves first of all some definition of the types of changes 
which are considered desirable for teachers to effect. "These changes 
may be defined in terms of any of the various aspects of pupils thus 
far discussed. Needless to say, the aspect which is most often con- 
sidered a criterion of teaching success is the pupils’ achievement of 
instructional objectives as these were presented in Chapter II. 
Typical of these objectives are ability to solve arithmetic problems, 
ability to read rapidly and with comprehension, and understanding 
of a particular topic in American history. When instructional ob- 
jectives are formulated so as to include particular attitudes toward 
activities, institutions, or other attitude objects, the teacher's success 
may similarly be evaluated in terms of desirable changes in pupils’ 
attitudes, Thus it has been shown (10, Ix) that teachers who differ 
in liberalism have significantly different effects on their pupils’ 
liberalism and information concerning contemporary affairs. 

Among the other aspects of pupils which may thus be considered 
are various traits of character and personality, themselves essen- 
tially attitude configurations, which may not be explicitly included 
in formal lists of instructional objectives but are nevertheless con- 
sidered desirable products of any teacher's influence upon his pupils. 
Obviously, such aspects as pupils’ physical traits, general mental 
ability, and socio-economic background are not usually expected to 
change appreciably under the teacher’s influence and consequently 


The Teacher 453 


are seldom used as criteria for the evaluation of teachers. Physical 
health and the pupil's home environment may, however, be improved 
as the result of instruction in applied dietetics, home economics, and 
similar fields. The school or teacher could then reasonably be 
evaluated on the basis of the changes in diet, health, or aesthetics of 
the home at which instruction is aimed. Such an evaluation program 
has been outlined by Clark (6) in an attempt to determine whether 
the schools can improve the diet of a community. 

In any case, whatever aspect of pupils is selected as the one in 
which certain changes are desired as the result of beneficial in- 
fluences on the part of the teacher, the general scheme for teacher 
evaluation is to secure a measure-of the status of the pupils before 
they come under the teacher's influence and after they have been 
under it for some time. The difference between these two scores or 
the status in the given aspect is taken as the measure of the change 
in pupils during the time they have been under his influence. 
Arithmetic achievement, for example, can be measured with a 
standard arithmetic achievement test before and after the pupils 
have been taught by a given teacher. If the scores on the test greatly 
increase, it may be inferred that the teacher is capable and efficient 
in teaching arithmetic, while if only a small increase or a decrease 
has occurred the opposite conclusion is drawn. But certain cautions, 
to be discussed below, are necessary if these conclusions are to be 
valid, 

Advantages and Difficulties—The major advantage of this ap- 
proach to the evaluation of teaching success is its unquestionable 
validity, for there can be no argument against the thesis that the 
teacher who produces desirable changes in his pupils in the most 
aspects and to the greatest degree is the best teacher. If this ap- 
proach were readily available to school administrators and individual 
teachers, the problem of teacher evaluation might be considered 
solved, 

In practice, however, this approach is so beset with disadvantages 
and difficulties that it is usually used only in experimental research 
as a criterion for more practical procedures and devices for teacher 
evaluation. The disadvantages should be discussed both for the 
light they throw upon the nature of this approach and for the in- 


454 How to Evaluate 


sight thus afforded into the problems of educational research. In 
the first place, the use of the criterion of pupil changes raises the 
problems of defining what changes in pupils are desirable and of 
measuring such changes. These problems refer immediately to the 
what and the how of educational measurement and pupil evalua- 
tion. For attempts at their solution the reader is referred to the 
preceding chapters of this book. 

A. second set of difficulties in evaluating through pupil changes 
involves the problem of insuring that any difference or changes 
observed are due solely to the influence of the teacher rather than 
to other influences. A group of pupils may achieve instructional ob- 
jectives to a greater or lesser degree as the result of any of the 
following factors other than the effectiveness of the teacher's in- 
fluence: 

1. The general mental ability of the pupils. 

2. Special mental abilities of pupils related to particular types 
of achievement. 

3. Past educational experiences, or the amount and quality 
of the instruction received in earlier grades. It is difficult 
to hold this factor constant solely by means of equating 
scores on an achievement test given before a specific 
teacher's influence becomes active. 

4. The instructional materials, textbooks, manuals, workbooks, 
maps, visual materials, and notebooks which the teacher is 
able to use in instructional activities, 

5. The pupil’s socio-economic background and environment, 
especially the cultural level of the home and community in 
which he lives. 

6. The amount and quality of the supervisory assistance and 
leadership provided the teacher by his principal and super- 
intendent. 

- The teaching load and extracurricular duties of the teacher. 

- The general attitude toward work that characterizes the 

school as a whole. 

9. The quality of instruction in other areas of the curriculum 
than the one for which a given teacher is responsible or 
in terms of which he is being evaluated. This is important 
because the influence of all of a pupil’s teachers and in- 


ont 


The Teacher 455 


struction in all areas of the curriculum may and often do 
have an effect upon the pupil's achievement of any specific 
set of instructional objectives. 

10. The pupils achievement of objectives other than those 
evaluated by the test or other device that is used. The real 
worth of à teacher's influence may be reflected not so much 
in his pupils’ information and knowledge, which are most 
easily evaluated by available tests, as in their attitudes and 
emotional and social adjustments. 

It is because of the difficulty of taking into account all of these 
and other factors in pupil achievement that the approach to teacher 
evaluation in terms of desirable changes in pupils cannot readily be 
used either by school administrators or by teachers themselves for 
self-evaluation. Elaborate experiments (1) have been carried out in 
the attempt to evaluate teacher effectiveness in terms of pupil 
changes, but the results have been largely disappointing. The ac- 
complishment quotient (A.Q.) has often been used as a means of 
holding pupils" mental ability constant while evaluating a teacher's 
effectiveness in increasing his pupils’ achievement on such standard 
tests as the Stanford Achievement Test. Since accomplishment 
quotients are so meaningless as to be practically worthless for either 
individual or group measurement, it is not surprising that the 
results have not been found significantly correlated with estimates 
of teaching efficiency obtained elsewhere. It must be concluded 
that the approach to teacher evaluation through standardized achieve- 
ment tests of pupils has not yet reached a stage of development 
where it can be used with any confidence that worth-while results 
will be obtained. 

This does not mean that “before and after” evaluations of pupils 
have no value. Their great usefulness in measuring pupil growth 
must: be distinguished from their much lower practical value as 
indexes of teacher merit. Such evaluations can reveal very effectively 
the kinds and amounts of changes that have been produced in pupils 
over a period of time by all the change-producing agencies that 
affect them. The problem of tracing the individual causes of these 
changes and isolating the effect of the teacher from all other effects 
is far more difficult and cannot be solved by the routine use of 
“before and after” tests. 


456 How to Evaluate 


EVALUATION OF Aspects oF TEACHERS 


Types and Aspects of Teachers.—Because of the difficulties of 
evaluating teachers through changes in pupils measured by stand- 
ardized tests, a more frequent approach has been the evaluation of 
the aspects of teachers which are judged to have a relationship with 
desirable changes in pupils. These aspects may be classified in two 
ways according to the relative emphasis given them by two types 
of educators—those who stress the cognitive, intellectual factors in 
teaching success, and those who stress the non-cognitive, emotional 
factors. The first group includes the teacher’s mental ability and 
scholastic success, and the second includes his attitudes and interests, 
emotional and social adjustment, and background and environment, 
especially as these are related to the favorable development of pupils’ 
emotional and social adjustment and attitudes, 

It is our view that this second group needs far greater emphasis 
than it has heretofore received. This is in accordance with the 
growing realization that the teacher’s ability to inculcate “subject 
matter” is less related to the happiness and total adjustment of 
pupils, as well as to their social effectiveness, than are his qualities 
as a stimulating, sympathetic, and understanding guide in emotional 
and social growth. The zotal individual, however, is our concern and 
we shall therefore describe available procedures and devices for 
evaluating both of these classes of aspects. 

Cognitive Aspects of Teachers—Among the criteria which may 
be used to evaluate a teacher's mental ability and intellectual at- 
tainments in terms of both general culture and specific knowledge 
of teaching methods and subject matter may be included the fol- 
lowing: í 

1. Grades in teacher training courses 

2. Amount and quality of scholarly publications 

3. Membership and participation in professional societies 

4, Out-of-school contacts—services to the community and state 

5. Standardized objective tests of mental ability and achieve- 
ment ] 

The methods of applying each of these criteria are so straight- 
forward that they need be discussed here only briefly. Needless to 
say, the teacher-in-training or the teacher on the job usually realizes 
full well the implications concerning himself of the grades he has 


| 
4 
i 


The Teacher 457 


received in teacher training studies, of his activity or lack of it in 
producing scholarly publications, and of his contacts outside the 
school. The teacher who receives A or B grades throughout his 
educational career and especially in the courses in the teacher train- 
ing curricula will usually, other factors being equal, be considered 
superior. School administrators almost always give consideration to 
the scholastic record of any candidate for a teaching position. In self- 
evaluation, the teacher needs no instructions in interpreting his own 
record in these respects. The last of these criteria, however, may 
profitably be discussed at greater length because many teachers 
and teachers-in-training are unaware of some of the standardized 
tests which have been constructed for the purpose of enabling 
evaluations of the cognitive aspects of merit as a teacher. 

The mental ability of teachers may be evaluated with many of 
the tests of general mental ability that are used with college fresh- 
men and adults; they were discussed in greater detail in Chapter 
XIV. Among them are the Ohio State University Psychological 
Test and the American Council on Education Psychological Ex- 
amination, Similarly, their intellectual attainments can be deter- 
mined by means of standardized achievement tests similar to those 
used in evaluating the achievement of secondary school and college 
students. At present probably the most adequate battery of tests for 
this purpose are the National Teacher Examinations (13), published 
by the National Committee on Teacher Examinations of the Ameri- 
can Council on Education. The following were included in the 
battery of “common examinations” given on March 14-15, 1941: 


Title of Examination Time Required 
Reasoning .....0.0:cccerenerenneceuseenennnres 4o minutes 
English Comprehension .....--++++: 40 minutes 
English Expression ....... irtee 40 minutes 


General Culture... rene neeeer ees 180 minutes 
Current Social Problems 
History and Social Studies 
Literature 
Science 
Fine Arts 
Mathematics 


458 How to Evaluate 


Title of Examination Time Required 


Professional Information 120 minutes ' 


Education and Social Policy 
Child Development and Educational Psychology 
Guidance, and Individual and Group Analysis 
General Principles and Methods of Teaching 
Contemporary stars ope ia ah TIN M nic 60 minutes 


For the evaluation of mastery of the subject matter to be taught, 
the following “optional examinations" .were offered, elementary 
school teachers taking only the first and high school teachers taking 
any two others: 


Education in the Elementary School .......... 120 minutes 
English Language and Literature ...... disce ius 9o minutes 
Social Studies .. 9o minutes 
Mathematics Miami sub datas: akih eau rise mrs 9o minutes 
Biological Sciences ............ MIS eae ea 9o minutes 
Physical Sciences . 9o minutes 
Spanish ........ ; 9o minutes 
Brenchs h iis go minutes 
German . 9o minutes 
Latin. ........ 90 minutes 


"These examinations are all of the short-answer, multiple-choice type 
and are administered in two days, about six hours of actual testing 
time being required each day." 

Since these examinations are aimed at the group of aspects which 
we have termed "cognitive," namely, the teacher's information or 
knowledge about general and specific fields of human culture and 
his ability to comprehend, reason with, and apply such informa- 


* Teachers and teachers-in-training who desire to take the National Teacher 
Examinations, usually given in March, may obtain specific information from the 
National Committee on Teacher Examinations, 15 Amsterdam Ave., New York. In 
1942, the third year of the examinations, they were administered in 107 centers 
throughout the nation. The examining fee for 1942 was $7.50 per examinee. This 
fee includes the cost of reporting the test results to the school system, the examinee, 
and the president or dean of the college from which the GI obtained his 
degree. Unless he specifically indicates that he does not wish a report sent to bis 
college, it will be sent. 


The Teacher 459 


tion, the National Committee on Teacher Examinations emphasizes 
that the extremely important non-cognitive factors in teaching suc- 
cess are neglected by the examinations it provides. That is, however 
valuable may be the information concerning teachers that is pro- 
vided by the examinations—and their value is admittedly great—the 
fact remains that they do not approach such aspects as attitudes, 
emotional and social adjustment, and environment and background. 

Probably the best idea of the nature of the National Teacher 
Examinations can be obtained from an inspection of typical items 
in certain of the fields tested. Some of the items, quoted by Flanagan 
(7) are as follows: ° 


English Comprehension, Part III, Vocabulary 


84. preponderant 
1 determined. (5%, 2%) 
2 preparatory. (0%, 0%) 
3 advanced. (0%, 1%) 
4 thought out. (21%, 5%) 
5 predominant. (30%, 90%) 
omitted (16%, 2%); not reached (2875, 0%) 


English Expression, Punctuation 


Alll the childrens art classes were represented in the exhibit, and the class 
7 8 
which had done the best work received a prize. 
9 
7. 1 No punctuation necessary. (10%, 1%) 
2 children’s. (56%, 84%) 
3 childrens’. (27%, 15%) 
omitted (5%, 0%); not reached (2%, 0%) 
8. 1 No punctuation necessary. (73%: 95%) 
2 Class, (18%, 4%) 
omitted (7%, 1%); not reached (2%, 0%) 
9. 1 No punctuation necessary. (657%, 94%) 
2 work, (24%, 4%) 
omitted (9%, 2%); not reached (2%, 0%) 


2 After each answer the percentage of all candidates selecting that particular 
choice is given; the second figure is the percentage of the individuals with scores 
in the highest 10 per cent who chose that answer. ‘The correct answers are printed 
in bold-face type. 


460 How to Evaluate 


General Culture, Part I, Current Social Problems 


24. Freedom of speech is guaranteed to the American people by the 
1 unwritten law. (6%, 1%) 
2 penal code. (0%, 0%) 
3 first amendment to the Constitution. (63%, 98%) 
4 Zenger decision. (2%, 0%) 
5 Declaration of Independence. (2895, 1%) 
omitted (19/5, 0%); not reached (0%, 0%) 


Substantial modifications in the high school curriculum have been 
necessary since the World War to allow for 

expansion of the college preparatory curriculum. (14%, 5%) 

the presence in high school of many persons with slight . 
aptitude for book learning. (61%, 88%) " 
studies on a more difficult level, necessitated by better prepara- — 
tion in the grades. (3%, 3%) 

emphasis on cultural rather than vocational subjects because of 
decreased employment opportunities. (5%, 2%) 

consolidation of rural high schools, (10%, 2%) 

omitted (775, 0%); not reached (0%, 0%) 


a 


Professional Information, Part I, Education and Social Policy 
45- 


Ne 


uw" + € 


Professional Information, Part II, Child Development and Educa- 
tional Psychology 
19. If a child is continually a difficult behavior problem, the teacher 
` might best f 
l refer his case to a competent authority for investigation. 
(71%, 90%) 
2 discuss the problem with other teachers. (994, 5%) 
3 isolate him from the rest of the class. (2%, 0%) 
4 have him transferred to another class. (2%, 0%) ] 
5 devise a more effective form of discipline. (1697, 5%) 
omitted (1%, 0%); not reached (0%, 0%) 


Professional Information, Part III, Guidance and Individual and 
Group Analysis 
18. A boy applying for entrance to a highly selective school attained a 
percentile score equivalent to the 67th percentile of the previous 


year’s entering class in that school. The school counselor is justi- 
fied in concluding that the boy "rH 


The Teacher 461 


probably should not be admitted to the school. (8%, 8%) 

if admitted, would probably fail in his work. (2%, 0%) 

if admitted, would have to work harder than the average entering 
student in order to avoid failure. (30%, 17%) 

if admitted, would probably do slightly better than average 


work. (36%, 7076) 
if admitted, should be about 67th best in the entering class of 


one thousand. (10%, 2%) 
omitted (14%, 3°/,); not reached (0%, 0%) 
Professional Information, Part IV, Secondary School Methods 
38. To be successful, the “no-failure” policy adopted in some schools 
requires 
the individualization of programs of study. (60%, 80%) 
decreasing attention to the superior students. (4%, 3%) 
the abandonment of school grades. (20%, 12%) 
emphasis on vocational education, (4%, 2%) 
elimination of pupils who cannot do satisfactory work. (4%, 2%) 
omitted (8°%, 1%); not reached (0%, 0%) 


Teacher Knowledge of Child Psychology.—It will be noted that 
the National Teacher Examination on Professional Information 
approaches to some extent the teacher’s understanding of pupils, 
of educational policies and procedures in dealing with the problems 
of individual pupils. Another instrument which perhaps provides a 
more thoroughgoing and intensive evaluation of the teacher’s pro- 
fessional equipment in child and adolescent psychology is the Kelley- 
Perkins How I Teach inventory (8)? Its purpose is "to measure 
what teachers know about the wants, needs, problems, develop- 
mental status, and incipient personality disturbances of children 
and adolescents” (8:17). The items in the test were based on 
(1) material in books concerned with child and adolescent psy- 
chology, and with educational, clinical, abnormal, experimental and 
social psychology, (2) case histories of problem children, (3) descrip- 
tions of teachers most liked and disliked, and (4) observations 
made during visits to classrooms. These items were classified into 
three groups: teaching practices, opinions, and factual results of 
experimental study. Their scoring was based upon the judgments of 


yn Piblisted by Educational Test Bureau, 720 Washington Ave. S. E., Minneapolis, 
inn, 


A wN 


v 


IE He 


462 How to Evaluate 


ten recognized authorities in child, adolescent, educational, and 
clinical psychology, and in mental hygiene, psychiatry, attitude meas- 
urement, and personality. 

Items upon which the judges disagreed among themselves, or to 

which they objected on the grounds of ambiguity or irrelevance, or 
on which there was too close an agreement between them and a 
preliminary sample of 84 grade and high school teachers, were 
eliminated. Thus the items were endowed with what might be 
called "curricular" validity on the basis of the judges' responses, 
and “statistical” validity on the basis of their efficiency in dis- 
criminating between authorities and average teachers on the job. 
'The latter type seems particularly valuable in that "the items on 
which teachers and judges agree completely would obviously be of 
no use in the inventory, since such items would not discriminate 
between teachers" (8:26). The inventory was further refined on 
the basis of the responses of a preliminary sample of 68 representa- 
tive grade and high school teachers. Items were rejected which did 
not discriminate significantly between teachers in the upper and 
lower 5o per cent in terms of total scores on the inventory. This 
procedure served to insure that the inventory would be "internally 
consistent." 
* Furthermore, items which elicited markedly different average re- 
Sponses from grade and high school teachers were eliminated or 
revised, so that the inventory could be used with both groups 
without penalizing either one. A further evidence of validity was 
the significance of the difference obtained between teachers who 
were rated "plus" by their principals and those who were rated 
"minus." "Plus" teachers were those whom the school administrators 
considered to understand children best, whom they would choose 
as teachers for their own children; “minus” teachers were those 
whom the administrators considered to understand children least 
and to whom they would hesitate to send their own children. An- 
other evidence of validity was the significance of differences obtained 
between teachers in school systems known to emphasize the pro- 
gressive, mental hygiene point of view, and those in more traditional 
schools. 

Probably the most adequate idea of the nature of the How I Teach 
inventory is obtained from an inspection of the 3o items on practices 


The Teacher 463 


and opinions which were found to discriminate best between teach- 
ers in the highest and lowest tenth, on the basis of total scores, of 
the total sample of nearly 1000 teachers. "These items are as follows: 


Directions: Check each of the following actions or practices in terms 
of what your own practice is (or would be) in dealing with this problem 
or situation. For instance, if you judge the practice to be "decidedly 
good" write the number 5 after the practice. 

Use this scale: 1—decidedly harmful; 2—probably harmful; 3—doubt- 
ful value; 4—probably good; 5—decidedly good. 


1. Requiring an additional assignment from a pupil who mis- 
behaves in:class. yi steve neers dase sae POEM M RO 
2, Commending the high school pupil for not being interested 
in having dates. si. s aspetti eene emen 
3. Threatening to punish the pupil who tells lies. >... : +- 
4. Expecting a pupil to be able to give adequate reasons for his 
undesirable behavior. ........ 
Telling the child who masturbates that it leads to ruined 


harder. aia cies ene DRE NR eM estote Lih Ze Rai dn 
9. Denying pupils the privilege of talking without permission. 
10. Lowering class grades for misconduct in daske liae aa 
11. Expecting all children to conform to the standards of the 

school at all times. .... iem 
12. Making a child who misbehaves feel guilty and ashamed. 
13. Telling a pupil that he can succeed in any type of work if he 


older brother did. ........ mái 
15. Requiring the same standards of all pupils for a passing 
grade, cd P aan Ye Mahe pean qe RE is iia 
Directions: Indicate your opinion of each of the following statements; 
that is, if you “strongly agree” write the number 1 after the statement. 
Use this scale: 1—strongly agree; 2—agree; 3—undecided; 4—disagree; 
5—strongly disagree. 
16. The best education for children of low intelligence is a little 
less of the same kind of education planned for the more 
intelligent, |... seek cece seen te nne ee Siw ae nen ee 


464 How to Evaluate 


17. Children outgrow their early emotional experiences, as they 
do shoésiand. clothes. 11:5 5:751) ne BR ves 
18, Some pupils are just naturally stubborn. ................ 
19. It is better for a girl to be shy and timid than "boy crazy." 
20. The first signs of delinquency in a pupil should be received 
by a tightening of discipline and more restrictions. ....... 
21. The newer methods of education tend to standardize chil- 
cirenis! DÉDAVIOEEIUP ve Ie Leia soli EE EURO TE M RTT Us 
22. Fach time a pupil lies his punishment should be increased. 
23. If a teacher keeps school conditions exactly the same and 
gives all pupils an equal opportunity to respond, she has done 
E aa hg neu. De ener Arup REN a ud d 
24. If a child constantly performs for attention, the teacher 
should see to it that he gets no attention. ...............- 
25. Dishonesty is a more serious personality characteristic than 
unsotielfiss Uto o rb m OE aD LV TAM v an 
26. "The teacher's first responsibility in all cases of misconduct is 
to locate and punish the offender. ..........-.--..000005 
27. "Failed because of lack of application" is a reasonable ex- 
planation of a child's failure in arithmetic. .............. 
28. The best way to manage the aggressive pupil is to be equally 
Spprenue rm Nou Pow SOUS ARTE Ltd 
29. Most pupils need some of the natural meanness taken out 
DE AME n oO nena 20 2/0 QR n 
30. When a pupil obeys all the rules of the school, one can be 
sure he is developing moral character. ...............045 


The meaning of these items, which are most typical of all the 
items in Forms A and B of the inventory, is discussed by Kelley 
and Perkins as follows: 


"These items which discriminate so decidedly between the teachers in 
the upper and lower tenths of the total group suggest several implica- 
tions for the training of teachers. In general, the teachers in the lower 
group did not understand certain principles of child and adolescent de- 
velopment. These teachers do not realize that all behavior is an attempt 
to meet some need; they do not realize the nature of individual dif- 
ferences; they do not understand how to motivate children; they do not 
consider the effect of failure; they do not realize the importance of 
heterosexual adjustment; and they are unaware of the laws of learning 
and present knowledge relative to formal discipline. 

According to many of these items, the action of this lower group 


The Teacher 465 


would often result in making the child more insecure, in fostering 
hatred, rivalry, and jealousy, and in increasing conflict and tension. 
The answers to many of the statements suggest the teachers’ own needs 
for achievement, dominance, and even self-punishment. Training in the 
principles of mental hygiene must emphasize the necessity of seeking 
the real causes of behavior rather than punishing the symptoms of 
maladjustment (8 : 53). 


The How I Teach inventory should prove valuable to teachers 
and teachers-in-training who wish to evaluate their own under- 
standing of the principles of mental hygiene and of the teaching 
practices, opinions, and facts which psychologists consider to be 
related to the emotional and social adjustment of pupils. A teacher’s 
responses to this inventory thus reveal his fitness for the responsibility 
of dealing with youthful personalities in the classroom to an equal 
if not greater extent than does evaluation of his general mental 
ability, general cultural attainment, or mastery of subject matter. 

The ultimate validation of any such inventory will, of course, be 
in terms of changes in pupil adjustment. Thus, two groups of 
teachers might be chosen, one with high scores and the other with 
low scores on the inventory. The pupils of these two groups of 
teachers could then be evaluated in terms of emotional and social 
adjustment when they first came under the influence of the teachers 
and again after they had been under their influence for a semester 
or a year. If the pupils of the first group of teachers either improved 
or did not change in emotional and social adjustment, while the 
pupils of the second group of teachers deteriorated, this would be 
clear evidence that scores on the How I Teach inventory are indica- 
tions of the probable effects of teachers on pupil adjustment. It is to 
be hoped that this experiment, or one equivalent to it, will soon be 
carried out. The major difficulty will be in securing a sufficiently 
valid and respected measure of pupil adjustment. The procedures 
and devices described in Chapter XVI would probably be reasonably 
adequate for this purpose. Needless to say, the initial status of the 
pupils’ adjustment and all other factors affecting it while they were 
under the influence of the two groups of teachers would have to be 
held constant or controlled statistically. 

Attitudes Toward the Role of the Schools—In addition to deter- 
mining attitudes concerning pupils’ emotional development and 


466 How to Evaluate 


behavior problems it is frequently desirable to ascertain the degree 
to which teachers are aware of and in agreement with progressive 
ideas concerning the role of schools in modern society. An instru- 
ment entitled What Should Our Schools Do?* has been made 
available for this purpose. A series of 100 statements is presented to 
which the teacher or citizen indicates his attitude by underlining . 
agree or disagree. The statements cover the following seven broad 
categories of school policy: 


1. Willingness to accept change in the local educational program and 
research or experimentation in various aspects of the local school 
situation, (21 items) 

2. Readiness to accept an intellectually tolerant point of view, one of 
freedom from personal bias or prejudice. (17 items) 

3. Approval of the general idea of extending the scope of educational 
services in the areas not now generally accepted within the common 
school or public school system. (13 items) 

4. Desire to broaden the curriculum or to take the school out of the 
classroom and into the community life, and vice versa. (9 items) 

5. Willingness to reject the theory of formal discipline or to accept 
the proposal of breaking down the solid subjects. (22 items) 

6. Acceptance of the policy of individualizing the educational resources 
of the school or acceptance of the primary policy of education of 
personalities as against desiring to have mass instruction. (21 items) 

7. Acceptance of possible related consequences involved in a program 
of liberalizing the educational program. (8 items) 


The progressivism not only of teachers but also of others in the 
community, school board members and parents, may be evaluated. 
In this way the community environment of the school and its pupils 
may be revealed in relation to its favorableness for the school’s con- 
cern. with all important aspects of pupils. 

The reliability of the questionnaire estimated by means of the 
correlation between halves corrected by the Spearman-Brown 
formula was found to be .91 on the basis of the responses of 340 
teachers, and also .or for 360 parents. Percentile norms based on 
1546 teachers and 1673 parents in twenty-three Pennsylvania public 
school systems are furnished in the manual for the test. 


* Published by the Bureau of Publications, Teachers College, Columbia University, 
New York. 


The Teacher 467 


That a somewhat general factor of “educational enlightenment” 
may exist is indicated by the coefficient of correlation of .64 ob- 
tained ^ between the scores of 170 college students of education on 
the How I Teach inventory and the What Should Our Schools Do? 
questionnaire. Apparently students who are enlightened concerning 
pupil behavior and adjustment problems also tend to have more 
progressive ideas concerning the role of the schools. 

Ratings of Teachers by School Administrators.—A second method 
of evaluating aspects of teachers presumably related to teaching 
success is to have school administrators, superintendents, principals, 
or supervisors rate teachers by means of any one of the many avail- 
able teacher rating scales. These scales are usually not fitted. for 
self-evaluation as is evident from the decrease in the validity co- 
efficient from .55 to .20 obtained by Barr (x) with the Torgerson 
Teacher Rating Scale when it was administered by a. school ad- 
ministrator and then by the teacher herself. The criterion of validity 
in Barr's study was formed from a composite of two measures of 
pupil’s gains on the Stanford Achievement Test, a composite of 
seven rating scales applied by school administrators, and a composite 
of nine tests commonly associated with teaching success. Whether 
this criterion is itself valid depends of course on the validity and re- 
liability of the achievement tests, rating scales, and teacher trait tests 


„of which it is composed. 


Since the major purpose of the present chapter is to provide teach- 
ers with methods for self-evaluation, rating scales designed for use 
by school administrators are not highly relevant and may be dis- 
missed with a few brief general comments. The validity of these 
scales depends first of all upon the traits into which they analyze the 
components of teaching success, or those they list as significant for 
any consideration of teaching merit. Secondly, they depend for 
their validity upon the rater's skill, knowledge, and familiarity with 
these aspects of teachers. 

Some notion of the relative worth of various teacher rating scales 
may be secured from the following coefficients of correlation be- 
tween the criterion described above and the twenty variables of 
Which it was composed obtained by Barr (x : 131): 


"In an unpublished study reported to the authors by Kenneth Davenport. 


468 How to Evaluate 


Taste 20.—Tue Comrmcrewr or CORRELATION BETWEEN CRITERION Xe, AND 
Each or THE VARIABLES 


(N=66) 

Measure of Teaching Ability Correlation with Xe 
Strayer-Engelhardt Teacher Rating Scale + .05 
Michigan Teacher Rating Scale ................. + .05 
Gain in Raw Score, Stanford Achievement + .06 
Schutte Teacher Rating Scale .............-+.+- + .06 
Torgerson Teacher Rating Scade ............... + .o6 
Pennsylvania Teacher Rating Scale ....... 53 + .06 
Almy-Sorenson Teacher Rating Scale ..... 50 + .06 
Giles Teacher Rating Scale |... eee eee eee 48 + .06 
Knight Aptitude Test) 00 40 PEERS So 43 t .07 
Psychological Examination .............. 36 + .07 
Torgerson Professional Information Test .. 35 + .07 
Social Intelligence Test ..... 35 + 07 
Gain in Accomplishment Quotient . 35 + .07 
Personality Rating Scale ...... 26 + .08 
General Merit Rankings 23 + .08 
Morris Trait Index L 22 + .08 
Strong Vocational Interest Blank .. E20 $08 
Torgerson Teacher Rating Scale (self-rating) ........ 20 + .08 
New Stanford Arithmetic Test .................... 47 + .08 
Wood Fréalth Scale: Freee Oral ani PUE EL AE .04 + .08 


For a more detailed description of the measures Barr used and for 
references to publishers of the devices used, the reader should con- 
sult pages 79-84 of his book. 

Ratings of Teachers by Pupils.—To an even greater extent than 
school administrators, pupils have an opportunity to observe a 
teacher in action. Consequently, teachers can be evaluated on the 
basis of the ratings their pupils give them on various aspects of 
teaching merit. Numerous rating scales for this purpose have been 
designed. Of them probably the most widely used is the Purdue 
Rating Scale for Instructors, a graphic rating scale covering ten 
aspects of teachers, each of which is presented in a manner illustrated 
by the following: 


The Teacher 469 
Interest in Subject 


Always appears full of Seems mildly interested Subject seems irksome 
his subject to him 


The ten aspects which pupils are asked to rate are: 

1. Interest in subject 

2. Sympathetic attitude toward students 
3. Fairness in grading 

4. Liberal and progressive attitude 

5. Presentation of subject matter 

6. Sense of proportion and humor 

7. Self-reliance and confidence 

8. Personal peculiarities 

9. Personal appearance 
10. Stimulating intellectual curiosity 

The unique value of pupils’ ratings of teachers stems from the 
directness of their approach to certain highly important factors in 
the learning situation—the attitudinal factors, or the emotionally 
toned feelings, opinions, and reactions. As one of the present authors 
elsewhere points out (15): “All psychologists, whether eclectic in 
their allegiance to theory or as members of one or another of the 
various schools of psychology, agree upon the importance of the 
affective or feeling components of learning. Whether these are dis- 
cussed under the varying currencies of psychological theory such as 
the law of effect, motivation, affective valence, or other similar 
headings, there is substantial agreement that children’s attitudes 
are of primary importance in the effective acquisition of knowledge, 
skill, interests, attitudes, ideals, etc., with which the school purports 
to concern itself, What the psychologists are agreed upon as sound 
theory would also be substantiated by an adequate sampling of the 
common-sense judgment of parents and children.” Obviously this 
carries strong implications for the desirability of evaluating pupils’ 
attitudes toward teachers. 

Furthermore, the indirect objectives of instruction, or concomitant 
learnings, such as attitudes toward subjects, attitudes toward the 
ideas represented by a teacher, a pupil’s attitudes toward himself as 
well as toward the more immediate, direct instructional objectives 
are similarly affected by his reaction to his teacher's personality and 


470 How to Evaluate 


methods. Thus, just as achievement tests in the usual sense are used 
to evaluate one group of effects of the teacher’s influence on pupils, 
so pupils’ ratings of teachers may be used to evaluate another, 
equally important group of effects of this influence. 

Arguments For and Against Pupil Ratings of Teachers—Before 
we proceed to describe the available devices for pupil ratings and 
the typical results obtained with them, we should consider some 
of the arguments which have been brought forth against and in 
favor of this procedure. In the first place, it has been argued that 
such ratings are undesirable because pupils are incompetent to 
judge the merit of either the process or the results of teaching, in- 
capable of distinguishing between bad and good teaching, and prone 
to judge what the teacher does rather than what he gets the pupil 
to do. This argument may be answered on the grounds that even 
if, as is doubtful, it states the truth, it is important to ascertain 
pupils' attitudes toward their teachers because they exist and exert 
a powerful influence on the effectiveness of instruction. The old 
adage, "You can lead a horse to water but you can't make him 
drink," applies here with peculiar force. 

In the second place, it is argued that attaching importance to 
pupil ratings commits the democratic fallacy of implying that that 
teaching is best which pleases the majority of pupils, and that 
teaching should be adjusted to achieve this end. This argument 
may be answered on the grounds that the best educational process is 
in essence democratic, and the use of pupil opinion makes possible 
a wholesome kind of cooperative effort to improve the learning situa- 
tion. 

In the third place, it is argued that pupils are inclined to make 
snap judgments which are consequently unreliable. But the available 
statistical evidence indicates that the average ratings of teachers by 
A group of pupils about equal in number to those in the average 
classroom have a reliability as great as or greater than that of most 
standardized achievement tests, 

In the fourth place, it is argued that pupils? judgments of teachers 
may be affected and distorted by such irrelevant factors as grades, 
amount of work required by the teacher, the pupil's interest in the 
subject, the difficulty of the subject, the preestablished reputation of 


The Teacher 47x 


the teacher, the general attitude toward school, and a lack of serious- 
ness in making the ratings. It can, however, be answered that cor- 
relational: studies have shown little relationship -between most of 
these factors and ratings of teachers; in particular, pupils’ grades, 
attitudes toward subjects, amount of work required by teachers, 
and general attitude toward school have been found to correlate 
to only a low degree, or not at all, with their ratings of teachers. 
It is more difficult to ascertain the effect of preestablished reputa- 
tion, but in so far as such an effect exists and influences present 
ratings, it also constitutes desirable evidence concerning a teacher. 
The lack of seriousness in making ratings would tend to lower 
the reliability of the ratings; however, since ratings have been 
found to be reliable, it may be assumed that pupils have in most 
investigations taken a serious attitude toward this assignment. In 
any case, it is possible to eliminate this factor by taking effective 
steps to establish rapport with pupils when the assignment is ex- 
plained to them. 

In the fifth place, it has been argued that pupil ratings tend to 
disrupt the morale of the teaching staff by arousing hostility, self- 
consciousness, discouragement, disrespect between colleagues, and 
attempts to cater to adverse pupil opinion through activities ir- 
relevant to good teaching. Wherever such a danger seems to be 
present, teachers should be permitted to keep their ratings strictly 
confidential rather than being required to submit them to their 
principals, supervisors, or other school personnel. This situation 
did not seem to appear in most of the published reports dealing 
with the problem. 

In the sixth place, it is argued that pupil rating tends to have a 
disruptive effect on the morale of pupils, that they may come to 
feel that they are the judges of the worth of teachers, curricula, and 
school activity. No evidence of this has been found in any of the 
rating schemes whose results have been published. Bowman (2) 
states, on the basis of several years of experience with having 
student teachers rated by their pupils, that the latter's morale is 
improved by the opportunity. 

Arguments in favor of pupil rating other than those mentioned 
in refuting negative arguments are, in the first place, the inevitably 


472 How to Evaluate 


powerful effect of pupils’ opinions of teachers, in the form of gossip, 
on pupils, teachers, and administrators which makes it advisable to 
recognize the influence, bring it into the open, and capitalize fully 
on its value. In the second place, a collection of pupil ratings enables 
a teacher to exercise a sort of self-supervision based on objective 
evidence obtained from those in the best position to observe him. 
In the third place, pupil ratings of teachers is an eminently prac- 
tical procedure: it seldom costs more than two cents per rating, 
requires only a few minutes of time, and fits easily into the regular 
routine of any classroom. 

Procedures for Obtaining Pupils’ Ratings of Teachers—The 
planning and execution of any program for the collection and in- 
terpretation of pupils’ opinions of teachers may be broken down 
into the following steps: 

1. The construction or selection of a suitable rating scale 
2, The administration of the scale, or having the pupils record 
on it their opinions’ of and attitudes toward various aspects 
of the teacher 
3. The interpretation of the ratings 
Let us now consider each of these steps in order. 

The same general rationale for the construction of rating devices 
which was outlined in connection with the evaluation of adjust- 
ment in Chapter XV applies to pupils’ ratings of teachers. ‘These 
principles can be illustrated in terms of some of the numerous pub- 
lished, commercially available devices for this purpose. These will 
be pea in the order of the grade levels for which they are 
suited, 

The Diagnostic Teacher-rating Scale constructed by "Tschechtelin 
and edited by Remmers (14) is designed for elementary school 
children in Grades 4 to 8 in expressing their attitudes toward their 
teachers. The scale is divided into two parts, a general survey in- 
strument by means of which a rather generalized picture of their 
attitudes can be obtained, and a diagnostic instrument with two 
comparable forms which provides detailed and specific information 
regarding the strengths and weaknesses of teachers, In both the 
general and diagnostic scales, seven different aspects of the teacher's 
work are considered: 


The Teacher 47 


Liking for teacher 

Ability to explain 

Kindness, friendliness, and understanding 

Fairness in grading 

Discipline (keeping order with the children) 

Amount of work required 

. Liking for lessons 

The manner in which the ratings are obtained is illustrated by the 
following samples of Form A from both the general and the diag- 
nostic scale. 


SAY EwW Yn 


A DIAGNOSTIC TEACHER-RATING SCALE 


Sister M. Amatora Tschechtelin Edited by H. H. Remmers 
Form A ü 


Name of school ING 
What grade are you in? Encircleone: 4 5 6 7 8 

Are you a boy or a girl? Encircle one: Boy Girl How old are you? — years, 
—— months. 

Following are a number of questions about your teachers, Please answer them honestly, 
Your teachers will never know how you have rated them. Do not write your name 


on this sheet. 


5 means ‘‘the best”; IV. How fair is your teacher 
4 means "'very good"; in grading? 
3 means "'average'' or "about as good as 
any teacher"; 
2 means "below average" or "less than 
for most teachers''; 


1 means "' very poor"' or "'the worst.” 


I. How well do you like V. How well does your 
her? teacher keep order with 
ars | the children? 


474 How to Evaluate 


Form 4 


Read each statement; if it tells something IV. Fairness in Grading 

true about your teacher place a plus sign 

C+) in the proper square at the left. 22. Always gives the grades 
earned. 


23. Gives fair grades. 


I. Liking for Teacher 


Teacher 


24. Is quite fair in grading. 


1. Is the one I like the 25. Gives fair grades some- 


best. times. 
2.Is humorous at 26. Gives the boys better 
times. 


grades. 


27. Grades some children 
too low. 


3- Keeps everything in 
the room neat. 


dI prece 28. Never is fair in grading. 
Br 
Sir not polite. V. Discipline 
(Keeping order with the 
6. Always wears a Children) 
frown. 
7. Is too grouchy. . Always keeps good order 


in a cheerful way. 
. Keeps good order. 


- Does not act "'bossy."" 
- Is always on time. 


- Is too easy-going. 


. Has a quick temper. 


- Always finds fault with 
everything one does. 


It will be seen that on the general scale the aspects of teachers 
are considered in terms of questions and the pupils are instructed to 
rate a given teacher on a five-point scale. On the diagnostic scale 
there are provided for each general aspect seven statements selected 
from several hundred such statements which were evaluated and 
scaled in accordance with the method developed by Thurstone for 
the construction of attitude scales (see Chapter XVII). Thus, the 


The Teacher 475 


diagnostic scale is made up of seven shorter scales with approxi- 
mately equal scale distances between the various diagnostic state- 
ments. For Form B, the general scale is the same as that for 
Form A; the statements in the diagnostic scale all differ in specific 
wording but are similar in significance to those in Form A. 

The method of administering the Diagnostic Teacher-rating Scale 
is indicated in the above sample. Two features of the administrative 
procedure should be noted: the plea for honesty in rating, and the 
guarantee of anonymity to the pupil. It will also be noted that space 
is provided for the simultaneous rating of four different teachers if 
this is desired. Furthermore, there is space for the pupil to add re- 
marks concerning aspects of the teacher which are not included 
in the rating scale. Much worth-while material concerning a 
teacher’s personality is frequently obtained by asking for such 
comments, D ; 

The reliability and. validity of this scale have been studied by 
Tschechtelin, Hipskind, and Remmers (15). Reliability was de- 
termined by an adaptation of the "splittest" procedure. “For 
thirty-one teachers in grades four to eight widely distributed geo- 
graphically the papers of each of the thirty-one teachers were 
divided into two chance piles and correlated according to the fol- 
lowing schema: ? 


Pupils 1'- 3-15 1.7 oy DER e 
N 
Pupils 2 4- 44-64-84: - Weis 


(N is the total number of pupils who rated each teacher.) The 
correlations between these chance halves were then stepped up by 
the Spearman-Brown formula" (15:197-198). The reliabilities 
ranged from .86 for "Amount of Work Required” to .96 for “Liking 
for the Teacher.” The reliabilities of the aspects of the diagnostic 
scale were obtained by the equivalent-forms method, that is, by 
correlating the scores obtained on Form A with those obtained on 
Form B. These ranged from .87 to .97. 

The validity of the measuring instrument may be considered from 
four different points of view. In the first place, since the instrument 


476 How to Evaluate 


is designed to measure the attitudes of pupils, it is sufficient to say 
that to the extent that reliable measures are obtained they are also 
valid since we are concerned not with the characteristics that 
teachers actually possess in the sight of some omniscient judge, but 
with the characteristics they possess in the eyes of the children 
whom they teach. In the words of T. L. Kelley (9 : 9), “If competent 
judges appraise Individual A as being as much better than In- 
dividual B as Individual B is better than Individual C, then it is so, 
as there is no higher authority to appeal to." In the second place, 
the logic underlying its construction is another argument for the 
validity of the scale. In so far as verbalized opinions are measures 
of attitudes and the scale measures verbalized opinions, as it quite 
obviously does, then it must also measure attitudes. In the third 
place, evidence for the validity of the short scale may be inferred 
from the degree to which it differentiates between the teachers 
whom the pupils selected as the best teacher and the poorest teacher 
they had ever known. A complete twofold division with no over- 
lapping was found between the two distributions of scores. A fourth 
argument for validity rests upon the lowness of the intercorrelations 
obtained between the seven aspects of teachers measured by the 
diagnostic scale and the highness of the correlations obtained between 
the general and the diagnostic scales, The lowness of the inter- 
correlation coefficients indicates that each of the attitudes measured 
is unique and relatively independent of all the others. 

The Bryan-Yntema Rating Scale for the Evaluation of Student 
Reactions (5) is designed for use at the secondary school level and 
enables the pupil to record his opinion of the teacher in the form 
of answers to the following thirteen questions: 


1. What is your opinion concerning the sympathy shown by this 
teacher? j 

2. What is your opinion concerning this teacher’s ability to maintain 
discipline? 

4 What is your opinion concerning his (or her) fairness in marking? 

4. What is your opinion concerning the ability of this teacher to 

explain things clearly? 

What is your opinion of the ability this teacher has to make the 

classes lively and interesting? 


w 


5 


The Teacher 477 


6 ‘What is your opinion concerning the ability of this teacher to plan 
classroom work and keep students busy in class? 

7 What is your opinion concerning the extent to which this teacher 
speaks in a lively manner with a clear and distinct voice? 

8 What is your opinion concerning the pride this teacher takes in 
his personal appearance? 

9. What is your opinion concerning the value that this subject has 
for you? 

10. What is your opinion concerning the general (all-round) teaching 
ability of this teacher? 

11, On what question from 1 to 8 (omitting the last two) did you 
give the lowest rating? Please state in a sentence or two why you 
gave a low rating on this question. (Your writing will not be 
recognized if you print your words or use a backhand slope.) 

12. Please name one or two things you especially like about this 
teacher. (Print or use backhand slope when answering.) 

13. Is this teacher in the habit of doing something, not mentioned 
above, that you do not like? If so, what is it? 


The directions for administering the scale and the form in which 
the questions are put are illustrated by the following reproduction 
of the first page of the scale. 


RATING SCALE FOR THE EVALUATION OF STUDENT 
REACTIONS 


Following are ten questions which should be answered by you and 
the other students in this class. If you answer them frankly and hon- 
estly, the results will give your teacher information on how you feel 
about this course, the teacher, and the procedures used by the teacher. 
This information will help the teacher to adjust teaching procedures 
to the needs of students in the future. Your teacher will never know 
how you, as an individual, answered the questions presented below. 
This leaflet has been arranged in such a manner that you can answer 
all questions without revealing your name through your handwriting. 
Do not write your name or the name of the teacher on any of these 
sheets. They will be collected and thoroughly shuffled before they reach 
the hands of your teacher, who will remain in front of the room until 
after all questions have been answered, 


478 How to Evaluate 


EXAMPLE 

1. WHAT IS YOUR OPINION CONCERNING THE SYMPATHY 

SHOWN BY THIS TEACHER? 

Excellent......... Always kind, considerate, and friendly. Always 
able to see and understand the student's view- 
point when a question, problem, or difficulty 
arises. 

(Good. ors Nearly always kind, considerate, and friendly. 
Nearly always able to see the student's point of 
view. 

REY. Generally kind, considerate, and friendly, but 
every once in a while fails to see the student's 
point of view. 

Below average.... Tries to be kind and helpful but is often impa- 
tient, grouchy, and sarcastic. Usually has diffi- 
culty in seeing the student's side of a question. 

Bonita Sais sian Almost always is harsh, grouchy, fault finding, 
and inconsiderate. 

The circle around “average” in this example means that this “example 
teacher” received a rating of average (generally kind, considerate and 
friendly) on question r. 

REMEMBER: All teachers can’t be above "average" in ability. Most 
students are “average” students, most doctors are “average” doctors, 
and most teachers deserve a rating of average or close to average on 
nearly all questions. 

THIS SUGGESTION WILL HELP YOU. Try to avoid giving a 
high rating on all questions or a low rating on all questions. Do this: 
first, look over all the questions on which you are going to give a rating 
and then make your rating on the two questions on which this teacher 
deserves the highest rating; second, make your rating on the two ques- 
tions on which this teacher deserves the lowest rating; third, make your 
rating on the remaining six questions. 


According to the authors of the scale, the regular classroom teacher 
can in most instances obtain just as frank and honest ratings when 
he conducts the class during the rating as when he calls in another 
teacher to obtain them. 

In scoring the rating scales, the teacher assigns values of 1 to 
“excellent” ratings, 2 to “good,” 3 to “average,” 4 to “below average,” 
and 5 to “poor.” A summary sheet is prepared which enables 
tabulation of the frequency with which ratings on a single aspect 
are given at each of the various levels from “excellent” to “poor.” 


The Teacher 479 


These frequencies are then multiplied by the values assigned to 
each level, the products are added, and the sum is divided by the 
total number of raters to obtain an average rating for each aspect 
of the teacher. Since, according to Bryan and Yntema, it is easier 
for most people to think of averages in terms of a scale of 1oo, the 
average ratings are transmuted to a scale from 60 for poor to roo 
for excellent. 

The average ratings for the ten aspects of teachers evaluated by 
the scale should be interpreted in the light of the pupils’ responses 
to Questions 11, 12, and 13. 

A teacher can usually improve the general all-round attitude of 
pupils toward himself and his procedures simply by eliminating 
the one or two outstanding faults in his personality or procedures. 
Bryan (4) has shown that improvement in the one or two aspects 
most responsible for a teacher’s low rating with pupils will produce 
higher ratings not only for these aspects but for all others. Bryan 
and Yntema offer the following illustration of this point (5 : 20): 


For example, take Teacher IV, who received a very low rating in 
discipline (and on most other items) because she could not control the 
students who kept the classroom in a constant state of confusion. Be- 
cause of this lack of control, this teacher would naturally receive a low 
rating on item 8 (ability of teacher to keep students busy), item 7 
(ability of teacher to make classes interesting), item 5 (sympathy— 
such a state of affairs would tend to make a teacher irritable), item 2 
(ability to explain things clearly—how could the explanation be clear 
when the confusion prevented many of the students from hearing what 
was said?), and so forth. In the instance of this example (and many 
others could be given involving other items), the restoration of a state 
of good discipline in class would automatically produce beneficial results 
in other areas. 


These two authors also present interpretations of the ratings 
received by fifteen teachers as illustrations of these procedures. Of 
them, those for Teachers II and XII are reproduced here. The 
numbers in parentheses after some of the comments indicate the 
number of pupils who made each one; These comments were made 
in answer to questions different from those in the rating scale 
described earlier in this section. They were: 

(1) Is there something, not mentioned above, that you especially like 
about this teacher? If so, what is it? (2) Is this teacher in the habit of 


L] 


480 How to Evaluate 


doing something, not mentioned above, that you do not like? If so, what 
is it? (3) Please give one suggestion that you think may be helpful to 
this teacher. 


TEACHER II. Ratings given by 21 tenth-grade students 


Ttem Average Key 

1 99 Knowledge of subject 

2 96 Ability to explain clearly 

3 97 Fairness in marking 

4 94 Discipline 

5 98 Sympathy 

6 93 Amount of work teacher does 
7 95 Ability to make classes interesting 
8 97 Ability to plan work 

9 99 Voice 
10 97 General teaching ability 


Favorable comments: Has a good attitude toward the class; seems to 
know how to make one remember; has a way of keeping the class in 
good order and quiet (3); has ability to explain clearly and keep student 
interested (3); allows a lot of fun and still gets work done; is very 
friendly; takes great interest in the students as individuals (2); has a 
fine sense of humor; can see both sides of a question. 

Unfavorable comments: None. 

Interpretation: The only benefit that this teacher can derive from the 
ratings of this class is the feeling of satisfaction and confidence that may 
result from knowing that his students feel that he is just about “perfect” 
and that they have no suggestions to offer for improvement. 


TEACHER XII. Ratings received from 30 tenth-, eleventh-, and 
twelfth-grade students 


Ttem Average Key 
I 86 Knowledge of subject 
2 8o Ability to explain clearly 
3 78 Fairness in marking 
4 84 Discipline 
5 85 Sympathy 
6 84 Amount of work teacher does 
7 85 Ability to make classes interesting 
8.4 85 Ability to plan work 
9 88 Voice 
10 83 General teaching ability 


The Teacher 481 


Favorable comments: lets us express our opinions; plans interesting 
trips; friendly and sympathetic (2); informal; usually jolly; easy to 
understand; likes to help other people (4); interesting talker; is very 
complimentary; hard worker. 

Unfavorable comments: Does not treat the students alike (6); marks 
unfairly; talks too much (9); gossips about other people (8); makes it 
her business when something happens to a student that does not con- 
cern her in the least; embarrasses students in front of others (2); gives 
very indefinite assignments; could be neater about her room—messy 
desk, dirty sink. 

Interpretation: 'The averages show that the confidence that the students 
have for this teacher is definitely below average. .. . The first steps 
necessary in order to gain the confidence of the students seem to be 
clear. They are: Avoid those things which make the students think she 
shows partiality, and stop gossiping about the students and other 
people. 

The reliability coefficients of the ratings obtained from thirty 
pupils are estimated to range between .83 and .92. With larger 
groups of pupils these coefficients would, of course, be increased 
so that the average ratings obtained from approximately sixty 
students would be well above .9o for all the aspects rated. The 
validity of the ratings is here again identified with their reliability, 
for there is no higher authority concerning pupils opinions of 
teachers than the opinions themselves. 

Although differences between the sexes, or between pupils in the 
upper and lower half of their class in school marks, are significant 
only infrequently in ratings of a given teacher, they do occur. If 
the pupils in such classifications are asked to mark their sheets with 
symbols such as X or F, separate tabulations can be made of the 
ratings by boys and girls or other groups. Thus the existence of 
differences between various groups in attitude can be revealed and 
interpreted for their diagnostic value. 

The Purdue Rating Scale for Instructors (3) has been widely 
used at both the high school and college levels. The brief description 
given earlier in this chapter need not be repeated here. Remmers 
(12) has made an extensive study of its usefulness at the college 
level. Among his findings with scales used by 115 instructors in 
293 classes were (1) that the maturity of students (year in college) 
has a negligible effect on average ratings, (2) that the halo effect 


482 How to Evaluate 


is unimportant when viewed from the standpoint of the reliability 
of a single trait, (3) that students agree very closely on the relative 
values of the ten teaching personality traits given in the scale, and 
(4) that the really significant factors affecting ratings are not 
students’ grades, size of classes, or institution at which the ratings 
are made, but rather the traits of different instructors as judged by 
students. 

Ward, Remmers, and Schmalzried (16) studied the extent to which 
teachers can and do change their teaching behavior when tliey are 
apprised of pupils’ attitudes, as obtained with the Purdue Rating 
Scale for Instructors, and when a training program to improve weak 
spots is undertaken. Forty student practice teachers in a high school 
were rated by their pupils about one month after they began teach- 
ing and again at the end of the semester. Between the two ratings 
the supervisor conferred with each student-teacher concerning the 
general standing and specific strengths and weaknesses revealed by 
the first rating. Each student-teacher knew that the ratings would 
be repeated at the end of the semester. 

The effect of the first ratings and conference was revealed by the 
differences between the two ratings. Only one of the forty student- 
teachers failed to gain in average rating on the ten traits. The average 
gain in all traits for the entire forty was highly significant. The 
greater gains were made in ratings for "self-reliance and confidence" 
and “sense of proportion and humor." The diagnostic and. remedial 
value of the ratings were reflected both in the relatively greater gains 
in the two traits in which studentteachers are probably most 
deficient and in the general gains. 


SUMMARY 


The teacher’s importance as a factor affecting the intellectual and 
emotional development of pupils is so great that no thoroughgóing 
evaluation of the pupil’s environment can be made without evaluat- 
ing the pupil’s teacher. Evaluation of teachers by themselves rather 
than by school administrators is the focus of the present discussion. 
Teachers can be evaluated either on the basis of the changes they 
bring about in pupils or on the basis of those aspects of themselves 
that are presumably related to their effectiveness in bringing about 
desirable changes in pupils. The first basis, while of unquestionable 


The Teacher 483 


validity, is beset with so many difficulties as to be impracticable in 
most situations. The second may proceed in terms of either cognitive 
aspects or non-cognitive, attitudinal aspects. Various tests of mental 
ability and intellectual-cultural attainments are available for evaluat- 
ing cognitive aspects. Teachers’ attitudes and opinions concerning 
questions of child psychology and pupils’ emotional and social ad- 
justment may be evaluated with the How I Teach inventory. Other 
aspects of the teacher’s effect on pupils may be ascertained through 
the use of pupil ratings of teachers. The procedure for obtaining 
such ratings is extremely practicable and the results are highly 
reliable and valid. a 


QUESTIONS 


1. Design an experiment to determine the effect of anonymity versus 
requiring signatures on ratings of teachers by pupils. 

2. How would you determine whether attitudes toward specific teachers 
change with the age of pupils? That is, are teachers who were not 
liked by children liked when the children have grown up? 

3. Can teachers ever be too intelligent? 

4. If teachers influence the social-psychological attitudes of their pupils 
as investigators have shown, would you advocate testing the relative 
“liberalism” or “conservatism” of prospective teachers? If so, how and 
with respect to what kinds of issues? 

5. Arrange in descending order of importance the various aspects of 
teachers listed in this chapter. Defend your ranking. 

6. It has been found that the social studies teachers who believe most 
strongly that the “right” attitudes should be impressed upon pupils 
tend to have the least effect in this regard, whereas teachers who 
believe that pupils should draw their own conclusions are more 
successful in influencing pupils in the direction of their own attitudes. 
Can you explain this psychologically? 

7. Are the cognitive aspects of teachers related to their attitude con- 
figurations? If so, how? If. you are not sure, how could you experi- 
mentally determine the answer? 

8. Have ten of your acquaintances rate you on a teacher rating scale 
and study the degree of agreement between chance halves of the ten 
raters. i 

9. Ratings on different desirable traits are generally correlated positively; 
ie. high ratings on one trait tend to be associated with high ratings 


484 How to Evaluate 


a 


M 


ce 


© 


on another trait. Some of this at least is “halo” effect. Can you think 
of any way to separate this “halo” effect from the true psychological 
overlap of such traits? 


REFERENCES 


Barr, A. S., and others, “The validity of certain instruments em- 
ployed in the measurement of teaching ability," in Lancelot, W. H., 
and others, The Measurement of Teaching Efficiency, New York: 
The Macmillan Company, 1935. 


. Bowman, E. C., "Pupil ratings of studentteachers," Educational 


Administration and Supervision, 20 : 141-147. (1934). 


. Brandenburg, G. C., and Remmers, H. H., The Purdue Rating Scale 


for Instructors, Lafayette, Indiana: Lafayette Printing Co., 1928. 


. Bryan, R. C, “Pupil rating of secondary school teachers," School 


Review, 44 : 357-368 (1938). 


. Bryan, R. C., and Yntema, O., A Manual on the Evaluation of Stu- 


dent Reactions in Secondary Schools, Kalamazoo, Mich.: Western 
State Teachers College, 1939. 


. Clark, H, F,, “An effort to extend the measurement of the results 


of schooling into the social and economic fields," American Educa- 
tional Research Association, An Appraisal of Technics of Evaluation: 
Symposium, pp. 20-24, February 26, 1940. 


. Flanagan, J. C., “An analysis of the results from the first annual 


edition of the National Teacher Examinations,” Journal of Experi- 
mental Education, 9 : 237-250 (1941). 


. Kelley, Ida B., and Perkins, Keith J., “An investigation of teachers’ 


knowledge of and attitudes toward child and adolescent behavior 
in everyday school situations,” Studies in Higher Education XLII, 
Bulletin of Purdue University, 1941. 


. Kelley, T. L., The Influence of Nurture upon Individual Differences, 


New York: The Macmillan Company, 1926, 


. Kroll, A., “The teacher's influence upon the social attitudes of boys 


in the twelfth grade,” Journal of Educational Psychology, 25 : 274- 
280 (1934). 


. Mason, H. M., “Effects of high school social studies teachers’ at- 


titudes upon attitudes of their pupils,” Studies in Higher Education 
XLIV, Bulletin of Purdue University, 1942. 


. Remmers, H. H., “The college professor as the student sees him,” 


Studies in Higher Education XI, Bulletin of Purdue University, 
1929. 


13. 


14. 


15. 


16. 


"The Teacher 485 


Ryans, D. G., "Measuring the intellectual and cultural backgrounds 
of teaching candidates; an analysis of the results of the second an- 
nual administration of the National Teacher Examinations," New 
York: Co-operative Test Service, Publications in Measurement and 
Guidance, Series N.T.E., Vol. 1, No. 1, August, 1941. 

Tschechtelin, Sister M. Amatora, and Remmers, H. H., Diagnostic 
Teacher-rating Scale, Lafayette, Indiana: Division of Educational 
Reference, Purdue University, 1940. 

Tschechtelin, Sister M. Amatora, Hipskind, Sister M. John Frances, 
and Remmers, H. H., "Measuring the attitudes of elementary-school 
children toward their teachers," Journal of Educational Psychology, 
31 : 195-203 (1940). 

Ward, W. D., Remmers, H. H., and Schmalzried, N. T., “The 
training of teacher-personality by means of student-ratings,” School 


and Society, 53 : 189-193 (1941). 


[III III EET 


CHAPTER XX 


Administering the Evaluation 
Program 


aeeeneueesnuseseucerssennsessesensssesssneees essent, TO yeunnansnscoeueusescsnananss! t 


WE HAVE THUS FAR BEEN CONCERNED WITH THE PURPOSES, CONTENT, AND 
construction or selection of evaluation procedures and devices. In 
Chapter I our thesis that the major function of evaluation is to 
provide data for guidance was stated and placed in relationship to 
several other uses of tests and measurements, In Chapters II to VII 
we discussed the question, What aspects of the pupil should be 
evaluated as a basis for guidance? The nature of each of these aspects, 
its importance to guidance, and its relationship to other aspects were 
all discussed. Thus far in Part II we have been concerned with the 
construction and selection of procedures and devices for the evalua- 
tion of each of these aspects of pupils. 

After a method of evaluation has been decided upon, the device 
or technique which it involves must be administered. The present 
chapter is concerned with the procedures for administering evalua- 
tion devices. Among the topics to be discussed are (1) the establish- 
ment of rapport between teacher and pupil, (2) the frequency with 
which various types of evaluation devices should be administered, 
(3) a general overall schedule for evaluations throughout a pupil's 
school career, (4) the environmental conditions under which evalua- 
tions should be made, (5) the importance of and procedures for 
strict adherence to standardized conditions of administration, and 
the allowable types of variation, such as oral administration and 
open-book examinations. 


Rapport 


Importance of Rapport—Rapport is a relationship between pupil 
and teacher such that the pupil has confidence in and is willing to 
486 


Administering the Evaluation Program 487 


"cooperate with the teacher. Rapport is necessary, in varying degrees, 
for the administration of all kinds of evaluation devices. Without 
it, emotional tension and inhibition will appear and prevent the 
_ device from giving a true picture of the particular aspect of the 

pupil being evaluated. In administering achievement tests the ab- 
sence of rapport will mean “nervousness” and inefficiency on the 
part of the pupil. He will not be able to do full justice to his own 
achievement. He may become so tense as to look for absurd subtleties 
in the directions for the test; his ability to recall or recognize things 
which he knows well may become “blocked” and vanish inte 
thin air. 

In evaluating physical aspects of pupils, inadequate rapport has 
been found by physicians to result in increased pulse rate and blood 
pressure; pupils may attempt to deceive teachers concerning their 
posture, their health habits, or their general state of physical well- 
- being. General and special mental ability similarly cannot be ex- 


© pressed in valid fashion if examiners do not establish rapport before 


_ giving the test. Its importance in administering a self-inventory, 
a vocational interest questionnaire, or an attitude scale has already 
been emphasized in the chapters dealing with these devices. It is 
evident that the necessity for rapport exists throughout the evalua- 
tion program for all aspects of a pupil and for all pupils. 

Securing Rapport—How can rapport be established and main- 
tained? No specific rules can be given. Procedures vary with pupils, 
situations, and the personality of the teacher. For the pupil who is 
too highly motivated, so eager to perform well on a test that he be- 
comes tense and inefficient, the testing situation should be de- 
emphasized and put in its proper perspective. The lackadaisical 
pupil who does not take the testing situation seriously and will 
not put forth his best efforts should be given some indication of the 
- importance to himself of doing well on it. The discouraged pupil 
who feels that his performance is inferior and has little hope of 
doing creditably on a test should be given encouragement in as 
- subtle and friendly a way as possible; perhaps, in order to increase 
his confidence, he should be given a goal lower than that which the 
teacher knows he can achieve. The overconfident pupil whose 
conceit leads him to exert less than his full energies may need to be 
offered comparison with higher standards to show him that he has 


488 How to Evaluate 


not exhausted the possibilities of achievement in a given field. 

The pupil who is fearful or ashamed of the truth about himself 
in any given aspect may need to be reassured that the evaluation, 
test score, or mode of adjustment will be considered a confidential 
private affair between himself and the teacher or counselor from 
whom he receives guidance. The value of self-knowledge and self- 
understanding as a basis for self-improvement and more adequate 
adjustment may need to be pointed out to him. The teacher should 
adopt a completely non-critical, “non-moral” attitude toward him 
when attempting to evaluate his emotional and social adjustment. 
Praise and censure should both be withheld. In some cases, how- 
ever, pupils who are ashamed of certain facts about themselves 
should be reassured; for example, they can be told how frequently the 
same facts are true about other pupils. 

In general, moderate encouragement and praise should be given 
to pupils when evaluating their status in intellectual or cognitive 
matters, while receptiveness and sympathetic understanding should 
keynote the teacher’s role in evaluating physical aspects, emotional 
and social adjustments, attitudes and interests, and environment and 
background. In evaluating the former the teacher should be a skilled 
“stimulator”; in the latter he should be a skilled “respondent,” 
neither praising nor blaming but simply understanding. The pupil 
should not be given to understand that in a certain respect he is 
good or bad, sinful or virtuous, desirable or undesirable. In this 
way his hesitancy to reveal facts about himself may be overcome 
and the basis laid for guidance in accordance with valid evidence. 

The teacher’s “personality” will determine the specific expression 
in any given situation of each of these general policies and pro- 
cedures for establishing rapport. A sincere and honest demeanor 
artistically blended with the proper proportions of “the light touch” 
and seriousness, of levity and gravity, is necessary not only in evalua- 
tion but in all relationships with pupils. Whatever words and 
phrases are used in giving encouragement and praise, in establishing 
confidence and trust, in stimulating a full effort, must always be 
natural and unstilted, friendly yet dignified. The teacher should be 
sensitive to the pupils’ reactions to his approach and keep his 
demeanor flexible so that he can modify it in accordance with these 
reactions. In establishing and maintaining rapport there is no sub- 


Administering the Evaluation Program 489 


stitute for the teacher's own personal adaptability and "social in- 
telligence." 

When announcing that tests or examinations are to be given, the 
teacher should attempt to prevent them from assuming undue im- 
portance and becoming a crisis situation. Both teachers and pupils 
should take examinations "in their stride" rather than permit them 
to become a major focus of the educative process. Continuous. 
regular study habits should be emphasized as preferable to “cram- 
ming" in preparing for examinations. In so far as possible the 
teacher should explain that the purpose of an evaluation is to enable 
both himself and the pupils to find out how well they are getting 
along so that he can give them better help; the purpose is not the 
handing out of rewards and punishments. 

When a test is administered to fairly large groups, the examiner 
should make some preliminary remarks designed to place it in its 
proper perspective. Its purpose, the uses to which the results will be 
put, the value to the pupils of the information provided by the test, 
should all be explained. Standardized group tests are usually ac- 
companied by preliminary instructions, to be read by the examiner, 
concerning the importance or unimportance of speed, the way in 
which the pupil should handle items which are too difficult for 
him, the desirability of guessing or not guessing on responses about 
which he is not certain, and similar points designed to establish 
the mental set most conducive to valid results. When such in- 
structions are not provided, as for teacher-made tests, they should 
be prepared in advance. Needless to say, the teacher's demeanor 
before large groups should be calm, encouraging, and helpful. All 
directions should be read aloud, clearly and distinctly. 


ScurpuLING Various Types or EVALUATION 


Achievement of Instructional Objectives—The frequency and 
regularity with which the various aspects of pupils should be 
evaluated depend on both the nature of the aspects and the facilities 
available to the teacher. Achievement of instructional objectives is 
usually evaluated several times during the course of study and almost 
always at the end of a semester in the form of a final examination. 
In determining the frequency of achievement tests a distinction 


490 How to Evaluate 


should. be made between two functions of evaluations: (1) as the 
basis for guidance and (2) as a means of instruction. There have 
been many investigations of the effect of different frequencies on the 
achievement of pupils. Some of these investigations have indicated 
that more frequent tests result in increased learning, but others 
failed to show this. 

While in this volume guidance has been considered the main 
function of evaluation, the instructional value of varying fre- 
quencies and types of evaluation should also be considered. Among 
the frequencies which have been compared with respect to their 
effect upon achievement are daily tests, three tests per week, tests 
at intervals of six weeks, and tests at only the middle and end 
of semesters. The general conclusion to be drawn from the conflicting 
results of these studies is that the optimum frequency for testing 
lies somewhere between two or three per week and at intervals of 
six weeks. 

/ Similarly, the beneficial effects of more frequent testing are 
greater for pupils of lower than for pupils of higher mental ability. 
Usually, however, it has been found that the superiority of the more 
frequently tested pupils as evidenced at the end of a semester dis- 
appears when the two groups of pupils, experimental and control, 
are retested at a later date, say from six to twelve weeks after the 
final examination. In the light of these studies, the best practice for 
classroom teachers is probably determined within the above limits 
of frequency by the organization of the subject matter in a given 
course of study. The teacher may also be guided by obtaining an 
expression of the pupils’ attitudes toward frequency of testing; he 
may well be surprised to learn that pupils prefer more frequent 
tests, if these are not “counted” on grades, as better incentives to 
regular study habits and as aids in avoiding the necessity of 
“cramming.” The general ability level of the class should also be a 
factor; classes composed of pupils with lower ability should be tested 
more frequently. 

| Exemption from final examinations has sometimes been used as 
an incentive to greater achievement. Here again the results of 
experimental studies are not in agreement. Some studies report 
an increase in achievement on the part of exempted pupils, but 
others find that achievement is superior in groups who know that 


Aaministering the Evaluation Program 491 


k they are to take a final examination. Thus, on the basis of experi- 
- ments in two college courses, mathematics and applied mechanics, 
Remmers (13 : 52) concluded: “While the present results cannot be 
generalized for all courses, the present writer would hazard the 


perimental variable (exemption from final examination) makes 
— relatively little difference in the amount, quality, or permanence of 
— learning, at least as measured by current types of tests and ex- 
| aminations." 

In view of the lack of conclusive evidence on the motivational 
_ effect of exemption from final examinations, it seems best to reject 
"the practice as more undesirable than desirable. In the first place, 
in most courses the final examination is usually the most carefully 
prepared evaluation device used by the teacher. To permit some 
— students to go unevaluated is to deprive their total cumulative 
-record of one of its most valuable items. This is especially true where 
standardized achievement tests are used with final examinations. In 
| the second place, it is to be questioned whether desirable attitudes 
toward examinations are furthered by setting up exemptions from 
them as rewards for high achievement or other good behavior. 

In the third place, marking systems are frequently seriously dis- 
rupted when teachers must keep in mind, while giving marks, the 
Critical point above which exemptions are granted. Serious dis- 
— crepancies from the usual frequencies of certain kinds of marks 
— have been found (3, ro) to result from teachers’ attempts to mark 
— in accordance with some system of exemptions from final examina- 
tions, In this case at least it is probably best to use these examina- 
‘tions for their known worth as evaluations rather than in the un- 
_ proved sense of their value as motivational devices. 

___ Pre-tests may be used to discover the achievement of instructional 
- objectives which pupils already possess as a result of previous school 
and out-of-school experiences. The results of these tests can be used 
to plan the relative emphases in instruction and, perhaps, to show 
- When certain parts of the course may be omitted because the pupils 
already know this subject matter. When well constructed, pre-tests 
also may serve as stimulators of interest in the materials to be studied. 
- Such a test thus may indicate to pupils the kind of achievement 
they may expect to acquire from the instruction they will receive 


492 à How to Evaluate 


if the test has been interestingly written, they may learn for them- 
selves the areas in which they are strong and weak and distribute 
their learning efforts accordingly. Needless to say, evaluations de- 
rived from these tests should not be used for marking purposes or 
in otherwise distributing rewards and punishments if the pupils’ 
favorable attitude toward pre-tests is to be maintained. Their diag- 
nostic functions must be their major purpose. 

Physical Aspects.—Evaluations of the physical aspects of pupils 
also vary in frequency and extensiveness. Every day, of course, the 
teacher should observe his pupils and seek to discover any who are 
ill and in need of medical care. In the lower grades this may be 
a formal, organized classroom inspection of cleanliness, clothing, 
and general health, In the upper grades, a more casual and informal 
approach, although lacking the motivational force of formal in- 
spection, is probably better and is adequate. 

In no case should pupils be permitted to complete their school 
careers without being thoroughly examined by a physician. How- 
ever cursory this examination may be under the most limited circum- 
stances, it is probably better than none at all. The frequency of 
periodic health examinations by physicians prescribed by the Ameri- 
can Medical Association (2 : 7) is as follows: 

From two to five years of age, semi-annually 

From five to fifteen years of age, every two to three years 

From fifteen to thirty-five years of age, every two years 
"These frequencies are minimal, of course; they should be increased 
whenever a teacher or nurse notices abnormal changes in the ap- 
pearance or behavior of a pupil. 

According to one manual (x : 49-50), “The average child could 
well be examined twice during the high-school years—at entrance 
and before graduation. The last examination may be combined 
with that required for a work certificate. 

“The present trend is to de-emphasize the annual periodic school 
health examination. In many cases, the time consumed by the 
annual repetition of the examination used all possible time available 
of medical, dental, nursing, and teaching staff. Annually, defects of 
vision, hearing, teeth, tonsils, nutrition, and personality were re- 
discovered, recorded, and filed away. Lack of personnel has led to 
hasty, perfunctory, and inexpert examinations so that no effort 


Administering the Evaluation Program : 493 


could be made to sort out even those cases needing immediate 
medical care." 

General Mental Ability—The general mental ability of pupils 
should be evaluated less frequently than every semester but more 
often than once in the school career. In general, two considerations 
determine the grades in which intelligence tests should be given. 
First, the imperfect although high degree of constancy of intelligence 
level as indicated by general mental ability tests makes it desirable 
that the scores or ratings of pupils on these tests be “refreshed” at 
intervals during their school career in order to reveal any fluctuations 
which may result either from real changes in their rate of intellectual 
growth or from errors of measurement due to imperfections in the 
tests. Second, the need for mental ability measurements is greater 
at some stages in a pupil’s school career than at others. 

It is probably sufficient to give a group test of mental ability to 
every pupil in the school every third or fourth year. Probably the 
best times for this testing are those when the pupils begin a new 
kind of experience or must make a choice among various curricula, 
Thus, entrance in the first grade is a desirable time because of the 
lack of preliminary data on a pupil that will help a teacher become 
acquainted with him in the shortest possible time. The beginning of 
elementary schooling at the fourth- or fifth-grade level should also 
be accompanied by a measurement of intelligence since it is known 
that the predictive value of mental ability scores decreases appreciably 
over a period of two or three years. 

On entering junior high school in the seventh grade the pupil 
may have to choose among various subjects and another testing at 
this point should provide fresh data with which to guide his choice. 
Entrance to high school in the ninth grade, where a choice is usually 
made among such curricula as academic, commercial, and technical, 
is another point at which a mental test is desirable. Similarly, the 
last year of secondary education, before the pupil goes to work or / 
enters college, provides an advantageous time for mental ability tests 
or their equivalent in a battery of achievement tests as a basis for 
vocational guidance. 

In a given school year the test should be given early enough to 
permit the score to be used for guidance purposes during the rest of 
the year. If decisions concerning curricula have to be made at thé 


494 How to Evaluate 


beginning of the year before tests can be given, scored, and made 
available to counselors and students, the tests should be given at 
the end of the preceding semester. 

Special Abilities.— Tests of special abilities should be given 
whenever information is necessary in counseling the pupil, as when 
choices of curricula and vocations are to be made, or failures or 
successes in school work require investigation and explanation. These 
tests, with the exception of art and music aptitude tests, will prob- 
ably be given mainly in the latter years of high school. A test of 
mechanical aptitude at the beginning of high school, however, 
may throw valuable light on the desirability of a technical curriculum 
for a given pupil. 

Emotional and Social Adjustment—The emotional and social / 
adjustment of pupils should, of course, be evaluated continuously, 
on every possible occasion, every day of their school career. The 
continuity of adjustment evaluation is implied in the anecdotal 
record technique. The other techniques, self-inventories and rating 
methods, cannot be used so continuously. In the past they were not 
used with all pupils, only with those who became problem cases or 
called themselves to the attention of teachers and school administra- 
tors in an undesirable way. If an anecdotal record system is being 
applied, self-inventories and rating methods may be used only with 
the pupils who the anecdotal records reveal are in need of further 
study and evaluation. Otherwise it is probably desirable to require 
all pupils to fill out an adjustment inventory and be rated by their 
teachers and fellow pupils at intervals. If the inventories used are 
easily administered and scored, this procedure should not put too 
heavy a burden upon the facilities of most schools. + 

Attitudes and Interests.—In so far as they are assigned a role in 
educational outcomes the attitudes and interests of pupils may be 
evaluated with the same frequency as the achievement of other 
instructional objectives. Thus they can be measured before and 
after certain educational experiences, at the beginning and end of a 
school year. When considered mainly as factors affecting pupils’ 
choices among curricula and vocations, attitudes and interests are 
usually evaluated at the beginning of high school by means of 
educational interest blanks, and toward the end of high school or 
in college by means of vocational interest blanks. It is desirable and 


- WE 


Administering the Evaluation Program 495 


practical to encourage every pupil to fill out such blanks rather than 
only those pupils who take the imtiative in expressing a need for 
educational or vocational guidance. š 

Environment and Background.—The environment and back- 
ground of pupils, like their emotional and social adjustment, re- 
quire continuous evaluation. The results of this evaluation should 
be passed on from teacher to teacher in the form of a cumulative 
record for each pupil. Teachers should make such additions and 
changes in the total accumulated information concerning the 
pupil's environment as are found desirable in the course of his 
growth and development. 

The organization and planning of the total evaluation program of 
a given school system, school, or classroom can best be carried out 
by the supervisor, principal, or teacher responsible for the particular 
administrative unit and best acquainted with its facilities and limita- 
tions. This planning and organization, by whatever course it 
proceeds, should result in (1) a schedule of testing and evaluation 
for each semester, including dates, places, and devices for each test; 
and (2) a program for training and instructing teachers regarding 
each of the tests to be used and the procedure for administering it. 


OPzN-BooK EXAMINATIONS 


In order to promote examinations which require “thinking” 
rather than mere recall of textbook information, examinations have 
been suggested which permit the use of textbooks and notes. Among 
the advantages claimed (17) for these open-book examinations is 
their value for testing the worth of a course, encouraging sound 
preparation on the part of the pupil, presenting a more natural 
situation, and necessitating comprehensive thought questions. 

These examinations need not be in the form of essay questions; 
short-answer questions of the types described in Chapter IX can 
readily be put in such a form as to require ability “to see situations 
in the whole,” to “use facts in solving problems,” to “draw inferences 
from known to unknown situations,” and in general to achieve 
broader, more permanent objectives of instruction than mere 
memorization of the facts and procedures stated in textbooks and 
lecture notes. 

Questions for open-book examinations must be constructed so 


496 How to Evaluate 


that the pupil cannot answer them by simply turning to the proper 
page in a textbook. Rather they must be arranged so that he is re- 
quired to use his understanding of the large sections or basic 
principles in the book, to draw together under a single heading the 
scattered definite cases of a fundamental generalization, or to cite 
specific instances of a major trend. Mathematics tests illustrate the 
possibilities of these examinations most easily; for example, original 
problems can be composed and presented which require so basic an 
understanding of the material in the course that the availability of 
the textbook for reference does not eliminate the necessity for real 
thinking and subject-matter mastery on the part of the pupil. 

It should be possible to construct similar examinations in all other 
subjects in so far as the teaching of these subjects stresses something 
more than the memorization of facts and procedures from textbooks, 
lectures, or other sources. In constructing open-book examinations 
teachers will be forced to stress mental processes other than memory. 
A similar shift in emphasis will result in pupils’ study habits and 
methods of preparation for examinations. Such evidence as is 
available indicates that open-book examinations appeal to pupils as 
more “natural” and representative of the type of use they will make 
of their achievement in out-of-school life. The Stalnakers (18) have 
reported their experience with these examinations in the humanities 
at the University of Chicago. By splitting the final examination into 
a three-hour morning open-book session and a three-hour afternoon 
closed-book session they were able to compare the relative standing 
of the students on the two types of examinations. 

In general, they found that the relative standing did not change 
and that both types of examinations gave results similarly related 
to the scores on an intelligence test. This means that the open-book 
gave neither more nor less advantage to the more intelligent student 
than did the closed-book examination. Whatever merit open-book 
examinations may possess will be shown in the types of questions 
they necessitate and by the kind of study habits and test prepara- 
tion they stimulate in students. Teachers should experiment with 
this method of administering achievement tests to determine what 
benefits it may yield in raising the level of the thought processes re- 


quired by examinations and of pupils’ methods of study and 
preparation for them. 


Administering the Evaluation Program 497 


ORAL ÅDMINISTRATIÓN or Group Tests 


In small school systems the program of achievement testing is fre- 
quently hampered by the lack of facilities for duplicating or mimeo- 
graphing evaluation devices. Even when these facilities are available, 
it frequently happens that the load of paper, clerical, and machine 
work that would be imposed by teacher-made tests if given as often 
as desirable is too great for the available facilities to handle. 
Teachers are then forced to solve their difficulty either by reducing 
the number of tests below what they consider desirable or by using 
essay tests in situations where short-answer objective tests are 
preferable. Oral administration, in which the teacher reads short- 
answer questions of the true-false, multiple-choice, or completion 
types and the pupils write on a sheet of paper the brief symbols 
necessary to indicate their responses, can frequently furnish a way 
out of this difficulty. 

Orally administered short-answer tests have been found to yield 
results not appreciably different from those obtained with printed 
tests in which the pupil reads the questions rather than listens to 
them. Thus Schultz and MacVaugh (15) found a correlation of 
70 between the Chapman Oral Intelligence Test and the National 
Intelligence Test for 88 seventh- and eighth-grade pupils; the 
correlation between odd and even items of the test given orally was 
86 for this group. These figures indicate that the orally administered 
intelligence test probably yields scores equal in validity and. re- 
liability to those of most printed tests of similar length. The test 
apparently did not suffer appreciably from being administered 
orally. Sims and Knox (16) obtained similarly favorable results 
with an orally administered five-choice vocabulary test. The multiple- 
response tests presented orally were only slightly more difficult 
than the same tests presented visually, were not seriously reduced 
in reliability, and tended to measure what was measured by the 
same tests presented visually. 

Several studies (4, 9, 11, 19) have shown that true-false tests 
given orally yield results which correlate as highly with the same 
test presented visually as the tests correlate with themselves. Ross 
and Gard (x4) found that two group intelligence tests involving 
several different types of short-answer items could be dictated to 


498 How to Evaluate 


pupils without serious loss in reliability or validity. It may be 
concluded from these studies that the teacher who lacks mimeo- 
graphing or duplicating facilities or who is forced to economize ori 
paper and clerical work can use orally administered tests with the 
expectation that the scores will reflect achievernent almost if not 
equally as well as tests presented in the usual printed form. 

In addition to economy and independence of duplicating facilities, 
several other advantages have been claimed for orally administered 
tests: 

x. Orally administered tests insure that every pupil will try 
every item and hence will be measured by every item. Pupils will not 
waste time on or be stopped by the more difficult test items, thus 
failing to attempt those at the end of the test. 

2, Pupils are constantly stimulated by the oral presentation so 
that each item constitutes a new challenge more distinctly set off 
from the preceding items in the test. i 

3. Oral comprehension ability is tested independently of, the 
pupil’s reading ability. Oral comprehension is only slightly if at 
all less important than reading comprehension to the pupil’s success 
as a worker and citizen, Tests which put a premium on this ability 
may serve as valuable complements to the usual stress placed upon 
reading ability in school. 

The major disadvantage of the oral administration of tests is the 
“personal equation” introduced by the teacher or other person who 
reads the questions, Clarity of speech, speed of reading, and volume 
all vary with different teachers. These variables can be greatly re- 
duced by strict adherence to a set of specifications for reading the 
test, but they must be fixed in advance. Such variables would be 
most undesirable for tests administered by many different teachers; 
the results might then be less comparable than those obtained with 
a printed test. Orally presented tests should, however, be used most 
for the teacher's own intra-semester evaluations. The more extensive 
and refined tests used for final examinations and the standardized 
tests with comparable norms used for guidance purposes are usually 
printed. Needless to say, whatever disadvantages may result from 
the test administrator’s faulty reading and speaking will be experi- 
enced to the same degree by all the pupils taking the test. 

A second disadvantage is that orally administered tests are some- 


I Administering the Evaluation Program 499 


what less economical of testing time than printed or mimeographed 
tests because fewer items can be presented orally per unit of time. 
And it is essential, of course, that pupils with hearing defects be dis- 
covered and given front seats. If sufficiently detailed and careful 
directions are prepared for orally administered tests, and if teachers 
are instructed and trained in giving them, they may be considered 
as comparable from teacher to teacher or from tester to tester as the 
highly respected individual intelligence tests which are administered 
orally in large part. 
Procedure for Oral Administration.—Preparations for administer- 
+ Jing oral tests should be carefully made. Pupils should be given a 
sheet of ruled paper on which to record their answers. Then they , 
should be instructed to write their names at the top of the sheet, 
together with any other personal data considered desirable. Second, 
they should be instructed to fold the answer sheet vertically along as 
many equally spaced lines as there are columns of answers. "Third, 
- they should be instructed to number along the vertical folds for as 
many items as the test contains. Care should be taken to allow suf- 
ficient space between the columns of numbers for writing the re- 
sponses. Fourth, the nature of each type of item should be ex- 
plained as it arises and the method of making the response should 
be specified at the beginning of each type of test item; one or two 
illustrative items should be read and the answers given, together 
with the method of recording them. This will help make clear to 
pupils the teacher's role and their own procedure. If some of the test 
items require computation, as in arithmetic tests, pupils should be 
furnished with an extra sheet of paper upon which to figure. Finally, 
~ each item should be read as clearly and distinctly as possible with 
L— an interval of several seconds between each one. The number should 
| be given before the item is réad. 
E "The pause between each item and the next one should be constant. 
— A regular rhythm with pauses of constant length after each item 
- should be established to which the pupils may quickly become ac- 
customed. 
- Teachers should not rely upon their own sense of timing to insure 
— proper and equal pauses; they should use a watch with a second 
hand or still better a stop watch. Pauses of different lengths may be 
—— found desirable for different types of test items. Typical of the times 


P, 


500 How to Evaluate 


required by the various types of items are the following: Chapman 
(5) allowed a ten-second interval after each item of his simple-recall 
information questions; each of these questions was read once and 
once only. Each of his simple arithmetic items was read twice and 
five-second intervals were allowed after each one. For his opposites 
yocabulary test Chapman read each word only once and allowed five 
seconds to elapse before reading the next word. For each item on 
sentence understanding to be answered Yes or No he allowed five 
seconds and read each sentence only once. Lehman (11) administered 
85 true-false statements in twenty-five minutes, reading each state- 
ment twice with an interval between each reading equal to the 
time in which he could count silently from one to ten as rapidly 
as possible, 


STEPS IN ApMINisTERING Tzsts To Groups or Pupits 


After a specific evaluation device has been constructed or selected, 
rapport between teacher or examiner and pupil has been established, 
a date and place for administering the test has been selected in 
accordance with the schedule of evaluation, and some decision has 
been made concerning whether the test is to be an open-book or a 
closed-book test or a printed or orally administered test, it becomes 
necessary to plan and execute the detailed steps for actually ad- 
ministering the test. We shall now list the steps. 

1. Responsibility. If the examination is to be administered to more 
than one classroom, such as a test of mental ability to be given 
in one day to all the pupils in a school, the responsibility for 
planning and executing the details of the testing program should 
be placed upon a single member of the school personnel who 
will make thorough preparations for every detail and supervise 
each of the following steps. If only one classroom is to be given 
the test, the teacher should, of course, take this responsibility. 

2. Securing tests. If the tests to be used are standardized, externally 
made, and commercially available, they should be ordered well 
in adyance of the date on which they are to be given. Most 
publishing companies require about ten days between receipt 
of an order and delivery of the tests. If the test is locally con- 
structed or teacher-made and is to be administered in printed 
form, the manuscript should be in the hands of the mimeo- 


Administering the Evaluation Program 501 


grapher or printer at least a week before the test is to be used. 
After the mimeograph stencils have been cut or the test is in 
page proof, it should be proofread meticulously, one person 
reading from the manuscript and another keeping his eyes 
on the stencil or proof. All capitalizations and punctuations, 
spacings and indentations should be closely inspected for correct- 
ness and for the probable legibility of the printed copy. Not too 
many extra copies of the test should be printed if it is planned 
to revise it in the future. 

. Schedule. If the whole school is to be tested at once or if 
several tests are to be given over a period of time, a schedule 
of hours and rooms in which specific tests will be given should 
be mimeographed and distributed to every teacher concerned in 
the testing program. Teachers should be urged to avoid emphasiz- 
ing the examination excessively so that it will begin to loom 
as a crisis in the pupils’ life. 

. Room for testing. The room in which tests are given should 
preferably be a familiar one, such as a classroom, and should 
be comfortable for sitting, reading, and hearing as well as 
writing. 

. Distractions. Distractions during tests should, of course, be 
avoided as much as possible. A sign should be put on the door 
of the room warning visitors to keep out because testing is 
going on. Noises from the street or from other classrooms should 
be reduced if possible to the point where they no longer distract. 
Pupils should be informed if they are to disregard school bells 
and continue working past the regular school period. 

. Distribution of tests to examiners. If more than one classroom 
is being tested at a time, the tests should not be distributed be- 
fore the day of the examination. Packages for each classroom 
should be made up with the proper number of tests, answer 
sheets, special pencils, or other materials well in advance of the 
day set for the testing. 

. Examiner training. Each person, examiner or teacher, who will 

administer the test should be provided with a test manual and a 

sample copy of the test several days in advance and urged to 

study the details of its administration closely, preferably by 
giving the test to himself. In this way, by the time the test is 


502 


IO. 


II. 


I2. 


13. 


How to Evaluate 


to be given he should have an intimate knowledge of the oral 
and written directions for it, the ways in which responses should 
be indicated, and the time limit for each part. 


. Adherence to directions. Strict adherence to the test manual in 


the matter of answering pupils’ questions after a test has begun 
or assisting pupils with test items, unfamiliar words, or mis- - 
understood directions is absolutely necessary with standardized 
tests if the scores are to be interpreted in the light of the norms - 
provided. If the directions in the manual are violated, it may 
be impossible to tell whether the norms are still applicable. 

Proctors. One proctor for every fifty pupils in excess of fifty 
should assist the examiner in handing out and collecting test 
materials and making sure that the pupils are following the - 
correct procedure in recording responses. 


Examiner's check list. Before giving the test, each teacher or — 


examiner should have a written check list of his duties covering 
in proper order such procedures as (a) seating pupils, (b) an- 
nouncing the nature and method of the test, (c) distributing 
writing materials, (d) distributing answer sheets, (e) distribut- 
ing test booklets, (f) giving instructions for filling in names 
and other personal data, (g) reading aloud the general direc- 
tions for the whole test, (h) the time limit for each subtest, 
(i) collecting the test booklets, the answer sheets, and writing 


menn and (j) packaging and forwarding the collected ma- 
terials. 


Examiner's attitude. Announcements and directions should be. — 


read aloud slowly and clearly. The examiner’s attitude should 
be sufficiently impersonal and efficient to command respectful 
attention but not so severe as to arouse anxiety in the pupils. 
Pupil seating. Pupils should be placed in alternate seats or be 
seated in some other manner that will make it difficult for one 
pupil to see another’s answers, 

Pupils’ names, etc. Pupils should be instructed not to write 
their names or other personal data until directions have been 
given. In the primary grades it is helpful if the teacher fills in 


the names and other personal data before the tests are dis- 
tributed. 


Administering the Evaluation Program 503 


Timing. If the examination is given with certain time limits, 
they should be observed carefully by means of a watch or clock 
with a second hand or, if available, a stop watch. All timepieces 
should be checked previously for accuracy. If possible, one of 
the proctors should check the time throughout the test with his 
own timepiece. In timing tests the examiner should write 
down the exact hour, minute, and second at which the signal 
to start is given. The time at which the signal to stop is to be 
given should be computed and also written down. 
Supervision. While the pupils are working on the test, ex- 
aminers and proctors should moye about the room unobtrusively 
to make sure that everyone is recording his answers in the 
proper way and is working on the right part of the test. Watch- 
ing a pupil over his shoulder or moving quickly about the room 
should be avoided because it may distract pupils from their 
work. 

.16. Ending the test. The examiner should make sure that all the 
pupils stop work immediately when the time is up. The answer 
sheets, or test booklets if answers are recorded in them, should 
be collected, and then writing materials. 

17. Notations. After the tests have been collected, any necessary 
notations of abnormalities in testing for any given pupil should 
be made, such as the need to leave the room, marked conditions 
of anxiety, or such disruptions of the entire group as a fire drill. 
They can then be taken into account in interpreting the scores. 
In the case of standardized tests the examiner should be es- 
pecially careful to note any discrepancies between the conditions, 
directions, or time limits under which the test was administered 
and those stipulated in the test manual. 

18. Absentees. Absentees should be followed up and required to 

take the examination so that every pupil will be evaluated and 

the teacher or counselor will not lack evaluative data on anyone. 


14 


15 


Scorinc Evauation Devices 


After the test booklets or answer sheets have been collected at the 
end of the test, they must be scored to determine which responses 
are correct, which incorrect, and which omitted, and the scores must 


504 How to Evaluate 


be added in the proper way to obtain a raw score. Despite the ob- 
jectivity of short-answer test scoring, certain procedures and pre- 
cautions are indispensable if this step in evaluation is to be carried 
out with a maximum of accuracy and efficiency. "The necessity for 
extreme care in planning and executing the scoring of evaluation 
devices has been indicated by several studies (7, 8, 12) which have 
found that scoring errors occur with appalling frequency. Both 
"constant" errors due to failure to understand scoring directions and 
resulting in scores which were consistently too high or too low, and 
"variable" errors due to carelessness in marking, adding, computing, 
or transcribing scores were found with sufficient frequency to 
warrant (1) the careful training and instruction of scorers and 
(2) the practice of rescoring at least a sample of any group of test 
booklets. 

Order of Scoring.—If one person is to score all the tests, a given 
page (or in the case of essay tests each question) in all the booklets 
should be scored first, then the next page, and so on, rather than 
scoring all of one booklet before going on to the next. If so many 
booklets must be scored that several scores are necessary, each per- 
son should specialize on a given page or group of pages of the 
booklet but should score only one page in all the booklets at a time. 

Rescoring.—If a large number of booklets are to be scored and 
sufficient clerical help is available, it is always worth while to rescore 
them so as to eliminate the errors which will almost inevitably occur 
in a clerical task like this. If a complete rescoring is not feasible, 
every fifth or tenth booklet should be rescored to obtain a rough idea 
of the frequency and magnitude of scoring errors. The rescoring of 
a sample will sometimes uncover such inaccuracy as to make it de- 
sirable for the remainder of the test booklets to be rescored and 
thus checked. 

Scoring Devices—Before scoring can be begun it is necessary 
to have scoring keys or stencils. The best contemporary standardized 
commercially available tests are provided with scoring keys, stencils, 
or other devices which enable rapid and accurate scoring. These 
keys are of the following four major types: 

1. Strip keys are used with tests in which the answer spaces are 
aligned along one side of the page in the test booklet. They contain 
the correct responses in a vertical column on a narrow strip of paper 


Courtesy of IBM 


Fic. 13.—The International Test Scoring Machine. 


Administering the Evaluation Program 505 


and are used in scoring by placing them adjacent to the column of 
responses in the test booklet and noting the agreement or disagree- 
ment of the pupil's responses with those on the strip key. 

2. Window stencils are used when the answer spaces are 
scattered over the page of the booklet rather than being in a single 
column. These stencils are usually made of heavy paper or celluloid 
in which circular or oblong holes are punched in such a way that 
when the stencil is placed over the page of the test booklet the 
answer spaces are visible through the holes and the pupil’s responses 
can be compared with the correct ones printed above the hole in the 
stencil. The number of pupil responses which agree with the correct 
answers on the stencil can be counted, thus yielding the pupil's 
score for that page of the test. Frequently, the stencil has heavy 
black lines connecting the holes so as to guide the scorer's eye move- 
ments, thus facilitating his speed and accuracy in inspecting the 
pupil’s responses, 

3. A third frequently used scoring device for present-day tests 
is the International Test Scoring Machine shown in Fig. 13. In order 
for tests to be scored by this machine the pupils’ responses must be 
recorded with a special soft pencil whose marks on paper will 
conduct electricity, and the answer sheet must be specially cut and 
printed. The answer sheet is scored by inserting it in the test scoring 
machine, pressing a lever so as to bring the sheet against contact 
units inside the machine, and reading the score registered by an 
indicating needle on a visible meter. The machine is set to dis- 
tinguish between right and wrong answers by means of a window 
stencil prepared and inserted in it beforehand. Many different 
varieties of scoring formulas, part scores, and weighted scores can be 
obtained by setting various switches on the machine. International 
Test Scoring Machines are now available at many testing centers 
and university bureaus. For large-scale testing programs such 
machines often provide the most accurate and least troublesome 
means of scoring objective tests. 

4. Another set of scoring devices are test or answer sheets whose 
reverse side has been prepared in such a way that correct answers 
will fall within circles or squares printed on that side. Thus, the 
Clapp-Young Self-Marking Tests (6) have answer sheets attached 
to each page of the tests. Answers to the multiple-choice items are 


506 How to Evaluate 


indicated by placing an X in the appropriate one of a row of small 
squares. 'The X is duplicated by means of a thin coat of carbon on 
the reverse side of the answer sheet onto another sheet which con- 
tains only the squares for the correct answers. If the pupil writes 
his X in the correct square, it will be transferred by the carbon 
into the square on the second sheet. The teacher can then score the 
tests simply by counting the number of duplicated X’s which fall 
in the small squares on the second sheet. Similarly, as was indicated 
in Chapter XIV in the discussion of the Ohio State University 
Psychological Test, Toops has devised a method of answering and 
scoring by which the pupil punches holes with a stylus in ap- 
propriate squares on an answer pad; his answers are scored by 
counting the number of holes or “pyramids” which appear within 
squares on the reverse side of the answer sheet. This method is also 
used in one form of the Kuder Preference Record. 

Marking Responses.—If a strip key or window stencil has been 
selected and prepared, as will be the case with most teacher-made 
tests, rules should be set up concerning which kind of responses, 
correct or incorrect, should be marked and in what way. If only 
correct responses are to be considered in the scoring formula, it is 
usually most economical of effort to mark only wrong answers and 
omissions so that the score can be obtained by subtracting the 
number of marks from the total number of items. If a scoring 
formula involving both right and wrong answers and omissions is 
being used, all three kinds of responses should be marked, each in 
a different way, such as a dash (—) for correct responses, an x for 
incorrect responses, and a zero for omissions. A colored pencil should 
usually be used in making all marks. Neatness and uniformity of 
size and spacing in marking responses will increase the accuracy 
of scoring. 

Summing Marks.— The next step is to count the number of marks 

of each kind and record on each page or at the end of each subtest 
the totals of rights, wrongs, and omissions. This is a major source 
of error if not done carefully, 
i Applying the Scoring Formula.—The totals of correct responses, 
incorrect responses, and omissions must then be substituted in the 
scoring formula being used. See Chapter IX for frequently used 
formulas. 


Administering the Evaluation Program 507 


Transcribing Scores.—Scores for each page or subtest must usually 
be transcribed to the front page of the test. This also is frequently 
a source of error. 

Transmuting Scores.—The transcribed raw scores generally have 
to be transmuted into some form of derived score, such as percentile, 
T-score, or scaled score. This is usually done by referring to a table 
that gives the derived score to which each raw score is equivalent. 
Derived scores will be discussed in Chapter XXII, but it is pertinent 
here to indicate that this last step in scoring also requires consider- 
able care if accuracy is to be achieved. 


SUMMARY 


Rapport between pupil and evaluator is essential to the validity of 
the evaluation process and can be achieved only through a combina- 
tion of certain attitudes and personal adaptabilities. The various 
aspects of pupils should be evaluated with different frequencies de- 
pending on the needs for guidance purposes and on the facilities 
available. Open-book examinations may prove valuable in improving 
study and teaching techniques and in raising the level of the mental 
processes elicited by tests. Orally administered tests reduce the 
clerical burden of teacher-made tests; they also have other advantages, 
especially if carefully designed and administered. The administra- 
tion of group tests requires careful planning and execution, particu- 
larly with respect to adherence to standardized directions. The 
scoring of tests, either essay or short-answer, requires rigorous train- 
ing of scorers, attention to detail, rescoring, and suitable devices if 
maximum accuracy and speed are to be attained. 


QUESTIONS 


1. Evaluate a school system well known to you on the basis of its 
facilities for administering timed standardized tests on a wide scale 
and obtaining accurate scores. 

2, Compare the validities for the needs of everyday living of work- 
limit tests, which give the pupil all the time he needs to complete the 
test, and time-limit tests, which fix the time allowed so that speed 
becomes an important factor in test performance. : 

3. It has been said that most life situations outside of school in which 
books are used are open-book situations. Consider this statement in 


508 


How to Evaluate 


the light of its implications for open-book examinations in school 
and for making school work more realistic. 


. Design an experiment to compare oral and printed shortanswer tests 


with respect to their cost in time and money and their reliability and 
validity for equal expenditures of time and money. 


. How feasible would it be to have a trained school psychologist in 


charge of administering the evaluation program in your school 
system? Consider the cost and the returns in both monetary and non- 
monetary terms. 


. What are the implications for all teacher-pupil relationships of the 


need for rapport in all evaluations of: pupils? 


10. 


II. 


REFERENCES 


. American Association of School Administrators, Health in School, 


"Washington: American Association of School Administrators, 
Twentieth Yearbook, 1942. 4 


. American Medical Association, Periodic Health Examination, 


Chicago; The American Medical Association, 1940. 


. Anderson, C. J., “Is the exemption system worthwhile?” School and 


Society, Vol. 3 (1916). 


. Briggs, G. H., and Armacost, D. H., “Results of an oral true-false 


test," Journal of Educational Research, 26 : 595-596 (1933). 


- Chapman, J. C, “A group intelligence examination without pre- 


pared blanks (revised form)," Journal of Educational Research, 
1I : 269-279 (1925). 


. Clapp, F. L, and Young, R. V., The Clapp-Young Self-Marking 


Tests, Boston: Houghton Mifflin Company. 


. Dearborn, W. F., and Smith, C. W., “The result of re-scoring 


id ee tests,” Journal of Educational Psychology, 20 : 177-183 
1929). 


» Herbst, R. L., “How accurately do teachers score achievement tests?” 


Journal of Educational Research, 22 : 405-408 (1930). 


. Jensen, M. E., “An evaluation of three methods of presenting true- 


false examinations: visual, oral, and visual-oral,” School and Society, 
32 : 675-677 (1930). 
Kemmel, H., “Exemption from examinations and grades,” School 
and Society, 8 : 112-114 (1918). 

Lehman, H. C., “The oral vs. the mimeographed true-false;" School 
and Society, 30 : 470-472. (1939). 


Administering the Evaluation Program 509 


. Pintner, R., “Accuracy in scoring group intelligence tests,” Journal 


of Educational Psychology, 17 : 470-475 (1926). 


. Remmers, H. H., "Exemption from college semester examinations 


as a condition of learning,” Studies in Higher Education. XXIII, 
Bulletin of Purdue University, November, 1933. 


. Ross, C. C., and Gard, P. D., "Two modified methods of administer- 


ing two group intelligence tests," University of Kentucky, Bulletin 
of the Bureau of School Service, Volume 2, 1930. 


. Shultz, R. N., and MacVaugh, G. S., “A short oral group intelligence 


test," School and Society, 40 : 639-640 (1934). 


. Sims, V. M., and Knox, L. V., “The reliability and validity of 


multiple response tests when presented orally," Journal of Educa- 
tional Psychology, 23 : 656-662 (1932). ^ 


. Stalnaker, J. M. and Ruth C., “Open-book examinations," Journal 


of Higher Education, 5 : 117-120 (1934). 


. Stalnaker, J. M. and Ruth C., “Open-book examinations: results,” 


Journal of Higher Education, 6 : 214-216 (1935). 


. Stump, N. F., "Oral vs. printed method in the presentation of true- 


false examinations,” Journal of Educational Research, 18 : 423-424 
(1938). 


Sn en Oan EuEETEE 


CHAPTER XXI 


Interpreting Scores 


AFTER THE TEACHER HAS CONSTRUCTED OR SELECTED AN EVALUATION 
device and has administered and scored it, the result is a 
raw score for each pupil. It is the purpose of this and the following 
chapter to discuss the procedures necessary for the next step in the 
evaluation process, namely, the interpretation of raw scores. 

Apart from providing the techniques necessary for interpreting 
raw scores, we shall also furnish the basic statistical equipment 
necessary for understanding the literature of experimental educa- 
tion, which consists largely of statistical interpretations of controlled 
observations. Furthermore, there will also be provided techniques 
for evaluating the reliability and validity of the evaluation devices 
themselves; the resulting measures will enable teachers to improve 
these aspects of their own evaluation devices. For those teachers 
who select rather than construct evaluation devices, these two chap- 
ters should provide an understanding of the techniques by which 
the commercially available devices have been constructed and evalu- 
ated and should thus improve the teacher’s own judgment in test 
selection. 

The Meaninglessness of a Single Raw Score.—Suppose a teacher 
has given a test, say a test of achievement of instructional objec- 
tives in a history course, or a test of reading ability, or an attitude 
questionnaire, Let us suppose further that one of the pupils ob- 
tained a raw score of 104 on it. 

What is the meaning of the pupil’s score of 104? Taken by itself, 
a single raw score has little meaning, If the teacher knows the num- 
ber of items in the test or the total possible score, the raw score can 
be used to determine the percentage of correct responses given by 
the pupil. Even this, however, is not meaningful as an indication of 

510 


Interpreting Scores 511 


the pupil's status because the difficulty of tests cannot be determined 
on the basis of a single pupil’s responses. For example, a twelfth- 
grade pupil may receive the total possible raw score on a test; yet 
unless the teacher has some basis for comparison, such as the 
performance of other twelfth-grade pupils on this test or a standard 
of achievement based on social utility, it is conceivable that this 
child's performance is no better than that of, say, the average 
seventh-grade pupil. The same situation prevails whether a pupil 
receives zero or an intermediate score on a test. Some basis for 
comparison is always necessary for the interpretation of a single 
raw score. 


Tue RELATIVE Nature or Test SCORE INTERPRETATION 


The comparisons necessary if raw scores are to be given meaning 
may be made either against absolute standards or against norms 
based upon other pupils. Absolute standards are those which enable 
the teacher to consider a pupil qualified regardless of the relation- 
ship between his performance and that of other persons. Thus if it is 
considered desirable and satisfactory for a pupil to be able to write 
his own name legibly, a test item that requires this and is answered 
correctly by the pupil can easily be interpreted. Similarly, if a job re- 
quires the ability to lift one-hundred-pound weights at a given rate 
of speed, the individual’s success at this can be readily interpreted 
regardless of the performance of others. As was stated in Chapter I, 
standards define the minimum degrees of excellence which society 
can accept in the performance of given tasks. Wherever such ab- 
solute standards are available, even a single test score can readily be 
interpreted simply by comparing it with the standard. These inter- 
pretations are valuable even in such simple forms as those that de- 
note a pupil’s performance as “successful” or “failing.” 

It is the absence of absolute standards in the fields of instructional 
objectives, general and special mental abilities, attitudes and in- 
terests, and other highly important aspects of pupils which makes 
norms far more widely used in the interpretation of raw scores. 
It is impossible to set up definite levels or kinds of performance in 
these aspects of pupils without relating the performance of pupils 
one to another. Thus, the spelling ability of a pupil of a given 
chronological age or school grade or level of mental ability cannot 


512 How to Evaluate 


be interpreted meaningfully as successful or failing with respect to 
an absolute standard. We can only know the degree to which this 
ability is creditable or commendable by relating it to the spelling 
ability of other pupils. This is so because we can never have for 
such aspects of pupils any external yardstick independent of other 
pupils which will enable us to say that a pupil spells well or poorly, 
regardless of his status in comparison with other pupils. The im- 
possibility of defining levels of performance on any given test with- 
out relating the scores to those of other pupils results in the necessity 
of techniques for making the indicated comparisons between pupils, 
or norms. Statistical methods provide the techniques necessary for 
this purpose for interpreting the raw scores of individual pupils. 

Ranking: A First Step in Test Score Interpretation.— The difficulty 
of interpreting a single raw score has already been explained. 1f in- 
stead of a single score the teacher has the scores of two pupils, the 
first step in interpretation can be taken simply by ascertaining 
which of the two scores is the higher. It is then probable (but not 
certain!) that the pupil with the higher raw score has more of 
tbe achievement, ability, or other aspect which the test is supposed 
to measure. Similarly, as the number of raw scores increases, the 
range of interpretability also becomes greater because the teacher 
has more of a basis for interpreting each single score. A simple 
ranking, or arrangement of scores in order of magnitude, thus 
enables him to interpret each raw score as an indicator of test 
performance. The relative nature of the interpretation of test scores 
is evident from this procedure—for the impossibility of stating 
whether a given score is successful or failing there is substituted 
the possibility of determining which score is best, better, worse, 
and worst. 

Grouping and Frequency Distributions.—A fairly large number of 
test scores, say thirty or more, can usually be interpreted more 
readily when put into groups, just as a mail clerk facilitates handling 
mail by sorting letters into mail bags according to destination. If 
the teacher groups together all scores of the same or approximately 
the same magnitude, he can more easily obtain a clear picture of the 
differences and similarities between them. The result of this grouping 
is a frequency distribution. The nature of grouping and frequency 
distributions is probably best clarified in terms of the procedures 
involved. In Table 21 are given the raw scores obtained by 100 


` 


6 mer 


Interpreting Scores 513 


TABLE 21.—Scorzs ror 100 Uwivemsrry SoemowomE SrupENTS on Forms A 
AND B or THE Purpue Reapine Tesr 


Student Form A Form B Student Form A Form B 
I 62 65 $51 74 72 
2 115 106 $2 71 74 
3 117 98 53 88 96 
4 120 125 54 64 67 
5 84 73 55 83 8 
6 87 78 56 119 112 
7 80 82 "157 IIS 108 
8 110 9o 58 97 76 
9 93 95 59 IM 220) 

10 100 6o 129 123 
II 89 91 61 68 71 
12 IOI 97 62 94 Ior 
13 103 100 63 75 8 
14 63 7 64 122 I2I 
15 104 94 65 80 
16 74 74 66 87 88 
17 76 81 67 107 III 
18 115 125 68 88 87 
19 104 103 69 1o8 106 
20 71 74 7o 113 118 
21 68 66 71 129 133 
2 35 85 7 e 
23 104 9. 73 9 96 
24 86 78 74 66 $5 
25 105 98 75 89 or 
26 113 113 76 109 79 
27 82 74 77 108 82 
28 104 102 78 e 2 
29 II 94 79 7 
3o H 99 80 88 82 
à 78 75 8 64 oe 
32 107 100 82 105 98 
33 79 81 83 IIl 89 
34 104 54 84 108 99 
35 5 34 85 Ee A 
36 7o 83 86 113 94 
3 45 5I 87 108 IOI 
38 107 106 88 95 7> 
39 74 66 89 116 114 
4o 1 114 9o 12I II i 
41 71 9r 9r 95 26 
E 9o 8o 92 84 7 
8 3 98 86 
43 E 9 6 
44 100 84 94 123 IH 
45 114 119 95 107 d 
46 ou 102 96 68 5 
47 106 107 97 8 2 
48 117 128 98 4 2 
49 éz 69 99 Zor 9 
50 105 98 100 128 107 


514 How to Evaluate 


university sophomores on Forms A and B of the Purdue Reading 
Test. Each form of the test is, of course, equivalent to a test in itself. 
These scores would be recorded in the teacher's class book in al- 
phabetical order, but in the table numbers are substituted for the 
students’ names. 

As these scores stand, their interpretation would obviously be 
difficult if not impossible without a table of norms or other inter- 
pretative aids provided by the test author. If the test were teacher- 
made, the teacher would have to supply his own norms or aids. 
Ranking the scores is the first step in this process. However, the 
procedure for grouping the scores into a frequency distribution will 
result in a form of ranking and a condensation of the data into a 
smaller, more comprehensible number of categories. We shall now 
take up in order the steps for grouping the Form A scores into a 
frequency distribution. 


1. Determine the range of the scores. The range is the difference 
between the highest and the lowest scores. Look through the 
list of scores and find the highest and lowest. Subtract the lowest 
from the highest. The highest score for Form A is 129; the lowest 
is 45. The range is the difference between 129 and 45, or 84. 

2. Determine the size of the class interval to be used. A class interval 
is the part of the total range between whose limits the appropriate 
scores are tabulated. The number of class intervals should usually 
be between 1o and 20. The class interval is preferably chosen so * 
that its mid-point will be an integral multiple of the size of the 
class interval. By dividing the range by 10 and then by 20 we can 
obtain some idea of the size of class interval to be used. Dividing 
84 by 10 gives 8.4 and by 20 gives 4.2. The size of the class 
interval should therefore be between 8.4 and 4.2. If intervals of 
odd numbers of units are used, the computation involved in 
obtaining subsequent statistical measures is facilitated; conse- 
quently the size of the class interval should be an odd number 
between 4.2 and 8.4. For purposes of the present illustration we 
choose a class interval of 7. 

3. List the class intervals in a column beginning with the highest. _ 


It is appropriate at this point to discuss the limits of a single raw 
score. Any single score, say 75, is considered to represent a distance 


Interpreting Scores 515 


along the continuum of scores ranging somewhere between 74 and 
76. This score thus represents a distance stretching all the way 
from somewhere above 74 to somewhere below 76. At what point 
between 74 and 75, and between 75 and 76, shall we set the limits 
of 75? If we had 10,000 pupils all of whom achieved a score of 75 we 
could be sure that in achievement on this particular test they were 
not really exaczly alike. A few had ability such that they just barely 
made the response that gave them a score of 75 and a few had ability 
such that they just barely failed to make the response that would 
have given them a score of 76. The remainder were somewhere be- 
tween these two extremes. The distribution of these 10,000 pupils 
all of whom scored 75 would appear as in Fig. 14 if they had been 
measured with a finer scale on the same abilities. 


N=10,000 


74.5 75 755 


Fic. 14.— Theoretical distribution of test scores given 
a value of 75. 


Disregarding certain technical considerations concerning the "true" 
measured abilities, we may say that 75 represents a mid-point whose 
limits are 74.5 and 75.5. In other words, 75 is the most representa- 
tive value for all those pupils with scores of 74.5 or more, and 75.5 or 
less. Thus, in setting down the column of class intervals, we might 
write for the upper limit of each interval a number which is half a 
unit greater than the highest score within that interval, and for the 
lower limit/a number which is half a unit less than the lowest score 
falling therein. Inspection of Table 22 shows that for Form A the 
limits of the highest interval are 129.5-136.5. This consideration is 
most relevant to the computation of various statistical measures 
which will be described below. In accordance with conventional 
procedure we have written the integral limits rather than the real 
limits in Table 22. 


516 How to Evaluate 


Taste 22.—Tue Scores IN TABLE 21 GROUPED INTO A Frequency TABLE, 
IrrusrRATING THE COMPUTATION OP THE ARITHMETIC MEAN AND THE MEDIAN 


Forn A 
f a4 f 
130-136 
123-129 H+ 5 5 a5 
116-122 7144 /// 8 4 32 
1og-ixs 7/77 H+H II 12 3 36 
102-108 HH HAF HH /// 18. 23 36 
95-101 7777- 1o 1 to 
88- o4 ttt HH II 135500 
Br- 87 4/4) LLL 9 mi-r9 
74- 8o ttt [HL] soy FUE 
67- 73-4 Il LARES Moret 1 
6o- 66 4 Il Tti as 
53- 59 — o u$ nig 
ab- 52/ 1 -6 — 6 
/ 897 45 / TSH 
N = 100 50 


A.M, = Guessed mean + ZE () 


= gr + © (7) = eso 
100 
Median = P measure 
Adding frequencies from below, i] 
rtrt7+7 +9 +9 +13 = 47; this takes 
us to 94.5, the real lower limit of the interval 
containing the median; 3 more to get x, 


04s + ae (7) = 96.60, the median. 


Form B 
TAa f 

/ ry 6 6 
HH s. 8. 25 
T $7 4 20 
39 15 7s er 
TN a 18 
HY HH HIT 19 å a 19 
Tt HT 14 o 
HHHH // 32 —1i —12 
HHHH LLL ig —2 — 36 
TUA T 8 —3 -4 
tll 4 —4 —16 
/ 2 L's ices 
/ 1 -6 — 6 
/ xm 

N = 100 13 


rtrtr+44+84+13 +12 = 495 this takes 
us to 87.5, the real lower limit of the interval 


containing the median; ro more to get Lh 
a 


2.808 fie (7) = 92.50, the median. 


Table 22 shows the columns of class intervals for Form A and 
Form B scores, The highest class interval for Form A is 130-136, 
the next is 123-129, and so on down to 39-45. Since the ranges of 
the scores for both forms were nearly the same, the same columns 
of class intervals are used for both forms, 


4. Tabulate each raw score in the appropriate interval by going 
through the alphabetically arranged list of scores and making a 
mark after the appropriate interval for each raw score which 
falls within it. For every fifth score in an interval make a di- 
agonal mark connecting the preceding four tabulations. After 
all the roo scores have been tabulated, total the number of marks 
in each interval and write it in the column labeled f, for fre- 
quency. The frequency in each class interval is thus the number 
of scores that fall within it. The sum of the frequencies in all 
class intervals should obviously be the same as the total number 


Interpreting Scores 517 


of pupils whose test scores are being interpreted. Write this total 
frequency, or N, at the bottom of the frequency column. 


Graphic Representations of Frequency Distributions.—Often the 
nature of a frequency distribution of test scores can be better com- 
prehended when shown graphically. For this purpose three major 
types of graphs are used, the frequency polygon, the histogram, and 
the ogive. In the frequency polygon and histogram, the horizontal 
axis is used to represent the scale of magnitude of scores, and the 
vertical axis to represent the frequency or number of cases of scores 
at each of the points along the scale of magnitude. The horizontal 
axis is pointed off in the same class intervals as those used in the 
frequency distribution, beginning at the left with the lowest class 
interval and proceeding to the right with as many intervals as are 
necessary to encompass the complete range of scores. The vertical 
axis is similarly pointed off, beginning with zero at the bottom and 
proceeding upward to the number necessary to indicate the fre- 
quency in the class interval which contains the largest number of 
cases. The difference between a frequency polygon and a histogram 


Score on Form A, Purdue Reading Test 
Fic. 15.—Frequency polygon. 


is that in the former the frequency in each class interval is indicated 
by a point sufficiently high above the center of the class interval to 
indicate the proper frequency along the vertical scale. These points 
are then connected by straight lines, as is shown in Fig. 15. In the 


518 How to Evaluate 


histogram, on the other hand, the frequency in each class intervalis | 
indicated by a rectangle whose base is equal to the width of the — 
class interval and whose altitude is sufficient to reach the point on 
the vertical scale necessary for the frequency within that class in- — 
terval. This procedure is indicated in Fig. 16. F 


Fic. 16.—Histogram. 


The ogive differs from the frequency polygon and the histogram 
mainly in that the vertical axis is pointed off in percentages, from — 
© to 100, of the total frequency in the distribution. Beginning with — 
the class interval in which no cases fall, we place a point above the - 
center of the lowest class interval at the point on the vertical scale | 
proportional to its frequency, add to this the frequency in the next - 
class interval, then add to the sum of these two the frequency in the 
third class interval, and so on. These cumulative frequencies are | 
equal to the total frequency in the distribution. The vertical scale 
may, of course, either be converted into cumulative percentages of 
the total frequency or be left in the form of cumulative frequencies. 
An ogive is illustrated in Fig. 17. 


Measures or CENTRAL. TENDENCY 


After making the frequency distribution, the teacher can deter- - 
mine the rank of any single score by counting down from the top 
of the frequency column until the class interval containing the score 


Interpreting Scores 519 


is reached. In most frequency distributions the scores are more 
abundant near the middle of the range of class intervals than toward 
the ends; that is, the scores tend to be “bunched” around some poiut 


Score on Form A, Purdue Reading Test 
Fic. 17—Ogive or cumulative frequency curve. 


near the middle of the range and to become relatively less and less 
frequent as we pass from this middle point toward either end of 
the range, By determining the point of “central tendency” we shall 
have a convenient way of designating the position of the distribution 
of scores along the scale of scores in terms of the single most rep- 
resentative score. Several measures of central tendency, or averages, 
are used for this purpose; the most frequent are the arithmetic mean 
and the median. 


520 How to Evaluate 


The Arithmetic Mean.—The arithmetic mean is already familiar 
to the reader, because it is the measure that is commonly referred 
to as the “average.” It is computed by adding all the scores in a 
group and dividing the sum by the number of scores. This may be 
expressed in a formula as follows: 


Sum of the scores 2X 
~ Number of scores N 


AM. 


Where X = each score in turn 
Z = an indicated summation 
N — number of scores 
When scores have been grouped into a frequency distribution, the 
computation of the arithmetic mean is facilitated by a short method 
which is expressed in the following working formula: 
2d. 
N 1 
where f= the number of scores in each class interval 
d — deviation, in class intervals, of each class interval from 
the guessed mean 
i = size of the class interval (in Table 22 this is 7) 


A.M. = Guessed mean + 


The steps in applying this short formula are as follows: 

1. Inspect the frequency distribution and select a guessed mean 
at the mid-point of some class interval. This may be the mid- 
point of any interval, but in order to reduce the labor to a mini- 
mum it should be about where the calculated mean is likely to 
come. In the illustrative problem shown in Table 22 this is 9r. 
This class interval is thought of as extending from 87.5 to 945 
in accordance with our previous discussion of the real limits of 
scores, 

2. Set the guessed mean equal to zero and call the class intervals 
(mid-points) one-step deviations from the guessed mean. Above 
the guessed mean the deviations are plus; below, they are minus. 
This step involves the assumption that the mid-point is most 
representative of all the measures in the interval. In Table 22 
this column of deviations is labeled d. 

3. Multiply the frequencies by the deviations so as to obtain the 
next column of figures, labeled fd in the table. 


Interpreting Scores 521 


4. Obtain the algebraic sum of the fd’s. This involves taking into 
account the sign of the fd products as well as their magnitude. 
In Table 22, Form A, the algebraic sum of the fd’s is 50. 

. Divide the sum of the fd’s by N, the total number of cases. In 
Table 22, Form A, this equals .5. 

6. Multiply the quotient obtained by the size of the class interval. 

In Table 22, Form A, this is 3.50. 

7. Add the product obtained to the guessed mean. In the illustrative 
problem, this is gr + 3.50 = 94.50. It is obvious that if the guessed 
mean is chosen so as to make Zfd negative, the product will be 
negative and will be subtracted (algebraically added). The mean 
obtained will then be less than the guessed mean. That the arith- 
metic mean obtained by the short method will be the same re- 
gardless of the guessed mean that is chosen will be seen if the 
reader carries out these steps using another guessed mean such 
as 98. 


The Median.—The median is defined as the point in a frequency 
distribution on either side of which lie 50 per cent of the cases. 


ut 


: N 1 i 
It is, then, the — measure. This average requires somewhat less 
2 


labor for its computation than does the arithmetic mean, and is 
better as a measure of central tendency if it is desired not to weight 
the scores in proportion to their deviation from the average. If the 
scores are not grouped in a frequency distribution, the median is 
obtained by arranging them in order of magnitude, or ranking 


. N 
them, and counting down from the highest score until the Ei 


measure is reached. If there is an even number of scores, there will 
be no single middle score but rather two; the median is the arith- 
metic mean of the two. Thus, suppose we have the following 
eighteen raw scores arranged in order of magnitude: 
33), 32 .332: 030A) 12910281. 1:27 26 | 26 
24:922: 214 152000420 E AT 16 16 


N. ; 
Since there are eighteen scores, — is equal to 9. Counting down to 
2 


the ninth score from both ends of the distribution shows that 26 
and 24 are the two middle scores. The arithmetic mean of these two 


522 How to Evaluate 


is 25, or (26 +24) +2. The procedure for computing the median 
of grouped scores is as follows: 


1, Compute M or the total number of scores divided by 2. 
2 


2. Begin at the lower end of the distribution and count the fre- 
quencies serially up to the class interval containing the median. 
In Table 22, Form A, this gives 47. 


3. Divide the number of measures required to fill out x by the 


frequency of the interval containing the median and multiply 
the result by the value of the class interval. In Table 22, Form A, 


this is ix 772.10. This step involves the assumption that 


the measures in a class interval are equally distributed through- 
out the interval. 

4. Add this amount to the lower real limit of the class interval 
which contains the median. In Table 22, Form A, this equals 
94:5 + 2.10 = 96.60. That the median obtained by this method 
will be the same regardless of from which end of the frequency 


ERA N i 
distribution the 7 Count is made can be demonstrated by the 


reader by adding the frequencies from the top of the distribution 
downward. This also provides a ready check on the correctness 
of the computation made by starting from the lower end. To 
illustrate, in Table 22, Form A, 5+8+12+18= 43; seven 


more scores are required to fill out x The median therefore 
equals ror (that is, the upper limit of the class interval, 95- 
101) minus Z X 7 = 96.60, 


Measures of central tendency, whether means or medians, are 
useful to teachers in several ways. First, they provide a point in a 
frequency distribution by which the teacher can determine whether 
the scores of given pupils are above or below the average per- 
formance of their class or whatever other group is involved. Second, 
comparisons between groups can be made whenever two or more 
groups have taken the same test. Thus, given the average scores of 
two classes, one of which took the test a year later than the other, 


Interpreting Scores 523 


the teacher can compare the average level of the first class with 
that of the second. Classes taught by different teachers but evaluated 
by the same tests can also be compared as to average level of per- 
formance. 

The choice between the mean and the median as a measure of 
central tendency depends upon whether it is desired to include or 
exclude the influence of the extreme or highly atypical scores which 
sometimes occur. If some pupils obtain perfect or zero scores on a 
test, they are not fairly measured, for the test is too easy or too 
difficult to reveal their true level of performance. By using the 
median as the measure of central tendency the teacher can exclude 
the influence of these atypical scores in arriving at the typical score 
of the group. In other situations the mean is probably more desirable 
especially since it lends itself better to mathematical treatment in 
' the computation of further statistical measures to be described below. 


Measures oF VARIABILITY 


After a measure of central tendency has been computed for a. 
frequency distribution, another feature of the distribution still re- 
mains to be described, namely, the variability of the scores or the 
degree to which they are scattered around the measure of central 
tendency. Two frequency distributions may have the same central 
tendency and yet be quite dissimilar in variability, as is shown in 
the following illustration: 


Class Interval Class I Class II 

: f f 
120-129 I — 
110-119 2 — 
100-109 4 = 
90-99 7 5 
80-89 10 10 
70-79 12 18 
60-69 15 $ 21 
50-59 12 18 
40-49 10 10 
30-39 7 5 
20-29 4 né 
10-19 2 — 
I UE 


0-9 


524 How to Evaluate 


The mean or the median of the distributions for both Class I 
and Class II is 65; yet it is evident that the scores in Class I are far 
more scattered around the mean or the median than are those in 
Class II. We can conclude that the pupils in Class I differ among 
themselves far more widely than those in Class II; that is, the 
pupils in Class I are more heterogeneous. Obviously it is desirable 
to compute a quantitative index of variability rather than to rely 
merely upon the impression gained from a visual inspection of a 
frequency distribution. 

The Range—One measure of variability is the range, or the 
difference between the highest and lowest scores. In the illustration 
above, the range for Class I is 125 — 5, or 120, and that for Class II 
is 95 — 35, or 6o. The range, however, is unsatisfactory as a measure 
of variability because of its complete dependence upon the extreme . 
scores in a distribution. One atypical pupil in Class II could have 
greatly increased the range if his score had fallen in the highest or 
lowest class interval, but the scatter of the majority of the pupils 
would not have been as great as the range indicated. Because of this 
disadvantage, the range is not a dependable measure of variability. 

The Quartile Deviation.—Another measure of variability that * 
is frequently used is the quartile deviation, or semi-interquartile 
range. It is defined by the following equation: 


Oe upper quartile — lower quartile _ Qu — Q, 
2 ION a 


That is, Q is the difference between the two quartile points divided 
by 2. The quartile points are measures analogous to the median. The 
upper quartile is the point in a frequency distribution above which 
lie 25 per cent of the cases, Similarly, the lower quartile is the point 
in the distribution below which lie 25 per cent of the cases. That 


is, the quartile points are — measures taken from the respective 


ends of the distribution. In Table 23 we find for Form A that the 

upper quartile, Qu, equals 108.5, and the lower quartile, Q}, equals 

80.5; therefore the quartile deviation, O, eager fes or 
2 

14.5. For Form B, Q, equals 103.14, and Q, equals 78.89; there- 


Interpreting Scores 525 


TABLE 23.—LLUSTRATING THE COMPUTATION or THE STANDARD DEVIATION (0), 
Prosaste Error (P.E.) anp Quartz Deviation (Q) 


Form A Form B 
fd f fm fà fa 
130-136 / 6 36 
123-129 HtA S S 25 125 HH 
116-122 ALE /// 8 4 32 128 T 3 E 
109-115 HHA Ht II 12 3 36 108 HA IL 21 63 
102-108 HAH HH ||] 18 2 36 72 HE HH 18 36 
95-101 HAH Ate 10 I 10 10 TE HE HH I 19 19 
88- o4 HE HEE LL 13 o HHH LT 
81- 87 HAL LLL 9-1 —9 9 HH HH // —I2 14 
14- 8o HHH [141 9 —2 —18 36 TH EHE LT —26 52 
67- 13 HH II 7-38 —-2 Ó5 744 UL —24 72 
6o- 66 HAE I|. 1 —4 -28 112 tl —16 64 
53- 59 pH mss / seas 
46- 52 / 1-6 -6 36 / =6 —6 36 
397 45 / 1-7-7 49 1 -7 49 
N = 100 so 748 13 669 
0-7 748 € 7 [2 = 
100 100 Aio, 
718.82 = 18.081 
P.E. = 67450 = 12.69 P.E.= 67450 = 12.18 
t= 108.5 : So. ERA Q- GIC 78.89 = 12.13 


fore O equals I4 19 or 1213. The quartile deviation is 


probably the best measure of variability in the kind of distribution 
for which the median is the best measure of central tendency, that 
is, in distributions containing relatively few cases, such as the 
number of pupils in a single average-size classroom, or in distribu- 
tions containing a few atypical scores, such as zero or perfect. 

The Standard Deviation.—The standard deviation is the most 
commonly used measure of variability. Defined verbally, it equals 
the square root of the mean of the squared deviations of the scores 
from the arithmetic mean. This definition is better comprehended 
in terms of the following fundamental formula: 


UE i 2d? 
Standard deviation = sigma = e = N 
d = deviation of each score from the 
arithmetic mean 
As in the case of the arithmetic mean, however, there is a shorter 
method than obtaining the actual difference between each measure 
and the mean of the distribution, squaring these differences, 


526 How to Evaluate 


and summing the squares of the differences. The working formula 
for the short method is: 


EB 


"The notation here is the same as for the formula for the arithmetic 
mean. The computation of the standard deviation (sigma) is shown 
in Table 23 for the data on Forms A and B of the Purdue Reading 
Test. A summary of the steps in computing it follows: 


1. Make a frequency distribution of the test scores. 

2. As in obtaining the arithmetic mean from scores grouped in . 
a frequency distribution, assume a mean at the mid-point of 
some class interval, preferably near the middle of the distribu- 
tion, and lay off deviations, in steps of one, aboye and below 
this mid-point. Note, however, that it is not necessary to com- 
pute the actual numerical value of the mean. 

3. Obtain the fd’s by multiplying each frequency by the deviation 
of the mid-point of the class interval from the mean. 

4. Obtain the f"s by multiplying each fd once again by the cor- 
responding d. 

5. Add the fa’s, divide by N, and square the result. 

6. Sum the fd”s and divide by N. 


2 
7. Subtract the Gp obtained in Step 5 above, from the re- 


sult obtained in Step 6 above. 
8. Obtain the square root of the difference obtained in Step 7 above. 
9. Multiply the obtained square root by the class interval. The re- 
sulting number is the standard deviation of the distribution of 
scores. 


In the next section the uses and interpretation of the standard 
deviation are discussed. 


Tre Normat Curve 
Interpreting the Standard Deviation.—In order to interpret a 
standard deviation, the reader should have at this point an intro- 
duction to the ideal, theoretical, mathematically defined frequency 
distribution known as the normal curve of error, the normal prob- 
ability curve, or normal distribution. The normal curve is a mathe- 


Interpreting Scores 527 


matical ideal in the sense that it is a product of pure reason rather 
than of experimental results. Its importance in statistics is due to 
the fact that it has been found to coincide closely with the actual 
distribution of certain types of data. Wherever the magnitude or 
frequency of a given phenomenon is determined by a large number 
of factors none of which has a disproportionately great influence in 
the determination and all of which act independently of one another, 
the shape of the distribution of the frequencies or magnitudes of the 
phenomenon will tend to approach the normal curve. 

An example of such a phenomenon is the frequency with which 
varying numbers of heads will be obtained if ten coins are tossed 
simultaneously many times. The range of the frequency distribu- 
tion of the number of heads is from o to 10 and the mean is 5. 
That is, 5 heads out of 10 will be obtained more often than any 
other number and the frequencies with which different numbers 
of heads will be obtained will decrease as we proceed from 5 upward 
to 10 or from 5 downward to o. The distribution will be bell-shaped, 
as shown in Fig. 18. 


275 


250 
225 


Relative Frequency 


Number of Heads 


Fic. 18.— Theoretical frequency distribution 
of number of heads obtained in tossing 10 É 
coins, 


528 How to Evaluate 


Human traits may or may not be distributed in a form similar to 
the normal curve. The form of the frequency distribution of a 
human trait depends not only on the trait but also on the way it is 
measured and on the sample of persons included in the frequency 
distribution. Thus human skin or eye color is not distributed 
normally because people fall into fairly distinct groups according to 
these traits. Human height is a more continuous trait, people being 
less distinctly grouped according to height. 

Similarly, one test of mental ability of a given group of pupils 
may yield scores which fall into a bell-shaped distribution. But an 
easier test may yield scores that pile up toward the high end of the 
range of scores, or in a negatively skewed distribution, as shown in 
Fig. 19. A too difficult test may yield scores that fall into a positively 
skewed distribution. 


Frequency 
Frequency 


Score on Test Score on Test 


Fic. 19.—Negatively skewed (left) and positively skewed distributions 
of test scores. 


The group included in a distribution of human heights might 
include half Chinese and half Scandinavians. The resulting fre- 
quency curve would have two peaks, i.e., be bimodal, with one peak 
at the average Scandinavian height and the other at the average 
Chinese height, as in Fig. 20. 


Frequency 


Score on Test 


Fic. 20.—Bimodal distribution. 


Interpreting Scores 529 


It is therefore evident that human traits are not necessarily 
distributed according to any fixed law, such as the normal curve. 
Consequently the normal curve is not useful in the immediate sense 
of relevance to the aspects or traits of pupils with which this volume 
is concerned. But for another class of data the normal curve is 
descriptive of the form of distributions. This class includes the 
various statistical measures which are discussed in this chapter, such 
as arithmetic means, medians, proportions, standard deviations, 
semi-interquartile ranges, and differences between means, standard 
deviations, proportions, etc. As will be seen below, if we know that 
a given set of measures is distributed normally and if we know the 
standard deviation of the set of measures, we can draw conclusions 
concerning the frequency with which measures of various magni- 
tudes will occur. 

Area Relationships.—The property of the normal curve which is 
probably most frequently used in the interpretation of frequency 
distributions is the following: The area under the curve included 
between a vertical line (or ordinate) erected at the arithmetic mean 
of the curve and a vertical line (ordinate) erected at any distance 
from the arithmetic mean, where the distance is expressed as a 
multiple of the standard deviation, is always the same proportion 
of the total area under the curve. Thus, the area included between 
a perpendicular line erected at a distance of one standard deviation 
from the mean on either side will always include 34.13 per cent of 
the total area under the normal curve. This is shown in Fig. 21. 


3c 26 16 M 1c 26 3o 
47.12% 49.87% 


Fic. 21.— Relationship between sigma distances of ordi- 
nates from mean and area included between ordinates. 


Similarly, if the difference between the mean ordinate and another 
ordinate is equal to .6745 standard deviation, the area between the 
two ordinates will equal 25 per cent of the total area under the 


530 How to Evaluate 


normal curve. This distance, .6745 standard deviation, is known as — 
the probable error; it is evident that 5o per cent of the total area 
under a normal curve will be included between ordinates erected 
‘on either side of the mean ordinate at a distance of one probable - 
error from it. In a normal distribution, or any perfectly symmetrical - 
distribution, the quartile deviation equals the probable error. The - 
proportions of cases included between ordinates erected at various | 
other distances from the mean, when the distances are expressed as | 
multiples of the standard deviation, are shown in Table 24. The - 
reader should practice using this table by verifying the figures for — 
various values given in this chapter. p 
Since the area included under any part of a curve is analogous to - 
the number of cases in the frequency distribution included between 
the points on the scale at which the ordinates bounding the area — 
are erected, the various proportions of the area may be interpreted — 
directly as proportions of the total number of cases in a frequency — 
distribution. It is upon this equivalence of area to number of cases 
and upon the relationship between the standard deviation and the 
area under a curve that the significance of the standard deviation as _ 
an expression of the variability of a frequency distribution depends. 
If the teacher knows the standard deviation of a frequency distribu- 
tion and that the shape of the latter is a fair approximation to the — 
normal curve, he can immediately draw conclusions concerning the — 
proportion of the total number of cases included between points — 
which are one standard deviation greater and one standard deviation — 
less than the arithmetic mean of the distribution. j 
Thus between plus and minus one standard deviation from the — 
mean will be included approximately two-thirds (68.26 per cent) - 
of the total number of cases; within plus and minus one probable - 
error from the mean will be included approximately 50 per cent 
of the cases. Similarly within plus and minus two standard devia- .' 
tions from the mean will be included roughly 95 per cent of the - 
cases, while between plus and minus three standard deviations from 
the mean will be included practically all, or 99.74 per cent, of the 
cases. These interpretations of the standard deviation are valid, of — 
course, only to the degree that the frequency distribution obtained 
approximates the normal curve. 


Interpreting Scores 


531 


Tape 24.—PEncENTAGE or ToraL Arga Unper tHe Normar Curve Between 
Mean ORDINATE AND ORDINATE AT Any Given Sicma DISTANCE FROM THE MEAN 


& 00 .01 


.02 


-03 


00.80 
04.78 
08.71 
12.55 
16.28 


19.85 
23.24 
26.42 
29.39 
32.12 


34.61 
36.86 
38.88 
40.66 
42.22 


43.57 
44.74 
45.75 
46.56 
47.26 


47.85 
48.30 
48.68 
48.98 
49.22 


49.41 
49.56 
49.67 
49.76 
49.82 


01.20 
05.17 
09.10 
12.95 
16.64 


20.19 
23.57 
26.75 
29.67 
32.38 


34.85 
37.08 
39.07 
40.82 
42.36 


43.70 
44.84 
45.82 
46.64 
47.32 


47.88 
48.34 
48.71 
49.01 
49.25 


49.45 
49.57 
49.68 
49.77 
49.85 


.05 


01.99 
05.96 
09.87 
13.68 
17.36 


20.88 
24.22 
27.34 
30.23 
32.90 


35.31 
37.49 
39.44 
41.15 
42.65 


43.94 
45.05 
45.99 
46.78 
47.44 


47.98 
48.42 
48.78 
49.06 
49.29 


49.46 
49.60 
49.70 
49.78 
49.84 


.07 


08 


03.19 
07.14 
11.03 
14.80 
18.44 


21.90 
25.17 
28.25 
31.06 
33.65 


35.99 
38.10 
39.97 
41.62 
43.06 


44,29 
45.35 
46.25 
46.99 
47.61 


48.12 
48.54 
48.87 
49.13 
49.34 


49.51 
49.63 
49.73 
49.80 
49.86 


532 How to Evaluate | 


SUMMARY 


Since single raw scores are relatively meaningless, test scores must 
be related to one another when interpreted. Ranking and frequency 
distributions furnish a beginning in this direction. Frequency poly- 
gons, histograms, and ogives serve to increase the interpretability 
of frequency distributions. Arithmetic means and medians furnish 
measures of central tendency or points most representative of fre- 
quency distributions. The range, quartile deviation, and standard 
deviation may be used to measure the spread or scatter of the scores 
around the measure of central tendency. The normal curve and 
its area relationships are essential to the full interpretation of the 
standard deviation. 1 


For questions and bibliography, see the end of Chapter XXII. 


"RI. 


c9 


CHAPTER XXII 


Interpreting Scores (Continued) 


jaugeusuauusecueuusnsoacensucnsenceaeesousensanuensunseaseenensonseuscenacenuernenseenseenensueereneenansnnnenen 


Maxine Test Scores COMPARABLE 


WE HAVE CONSIDERED THE MEANINGLESSNESS OF A SINGLE RAW SCORE. 
This extends not only to the interpretation of raw scores on a single 
test or the interpretation of a pupil’s standing on a single test, but 
also to the interpretation of the relative standing of a single pupil 
on two or more tests. That is, raw scores from different tests are 
not directly comparable. Suppose, for example, that Ray Brown, a 
ninth-grade pupil, made the following scores on four different 
evaluation tests given at the end of the semester: 


Algebra i 20b s Ies dee: 52 
Problems of Democracy .......... 116 
Mechanics of Written English .... 163 
Attitude Toward High School .... — 84 


Obviously, these raw scores tell us nothing except that he achieved 
something on each measurement; they do not tell how much. The 
scores are not on a common scale. We need to know his relative 
standing, that is, his standing in comparison with a defined group, 
such as all ninth-grade pupils in his school or in the county, state, 
or nation. We cannot tell from the raw scores alone whether Ray 
Brown was better in algebra than in Problems of Democracy, and 
so on. In order to compare his standing in the four different evalua- 
tions, we need to take into account not only whether he was above 
or below the average of a given group in each test, but also the 
degree to which he was above or below average. If on one of the 
tests, say algebra, the pupils’ scores did not differ much from one 
another, their standard deviation would be small. Then in order to 
achieve a relatively high standing in the algebra test, Ray would 
not need to exceed the average of his class by as many raw score 
533 


534 How to Evaluate 


units as would be necessary if the standard deviation were large. 
Obviously, then, both a measure of central tendency and a measure 
of variability must be taken into account in determining the 
relative standing within a group of any pupil with a given raw 
score. Of the many commonly used methods for making scores on 
different tests comparable in terms of both central tendency and 
variability, we shall discuss standard scores, T-scores, and percentile 
scores. 

Standard Scores.—A standard score, or z-score, is a score defined in 
terms of its deviation from the arithmetic mean in standard devia- 
tion units. The formula for z-scores is: 


X—M 
AR A T 
c 
% = standard score M = arithmetic mean of raw scores 
X = raw score g = standard deviation of raw scores 


It should be noted that all raw scores below the arithmetic mean 
will thus be converted into negative z-scores and that a raw score 
equal to the arithmetic mean will be equivalent to the z-score of 
zero. A raw score one standard deviation above or below the arith- 
metic mean will be equivalent to a z-score of plus one or minus 
one respectively. Thus, in terms of z-scores, the standard deviation 
becomes the unit distance of a frequency distribution. The useful- 
ness of these scores may be illustrated by comparing Ray Brown's 
relative standing on each of the four evaluation devices, as well as 
his composite score, by obtaining the required additional informa- 
tion, the arithmetic mean and the standard deviation of each test, 
and substituting them along with his raw scores in the formula for 
2-Scores, 
X (score) M X—M ^c z 
Algebra .... MONET 52 4792 + 4.08 10.00 +.0.41 
Problems of Democracy .... 116 120.26 — 426 2102 — 020 
Mechanics of Written English 163 — 16300 ^ o0 2319 0.00 
Attitude Toward High School 84 720 +120 2.40 + 0.50 
Combined z-score (arithmetic mean) 18 
It is now readily seen that Ray is well above ayerage in algebra 
and in Attitude Toward High School, slightly below average in 
Problems of Democracy, and exactly average in Mechanics of Writ- 


Interpreting Scores 535 


ten English. His combined score places him slightly above average. 
Every pupil's relative standing can be found in the same manner. 

T-Scores.—T-scores serve the same purpose as z-scores and are 
based on the same principle; they have the advantage, however, of 
being always positive and expressed in larger units, thus removing 
the necessity of dealing to such a great extent with decimal fractions. 
These advantages are obtained by converting the mean of the 
distribution to 5o and the standard deviation to 10 by means of the 
following formula: 


10 (X — M) 


o 


T-score = 50 + 


It is readily seen that z-scores are convertible into T-scores simply 
by multiplying the z-score by 10 and adding the product to 50. A 
T-score of 6o means a score one standard deviation above the mean; 
a T-score of 70 lies two standard deviations above the mean, and so 
on. T-scores of 40, 30, and 20 similarly indicate scores at one, two, 
and three standard deviations below the mean respectively. The 
changes from z-scores to T-scores are illustrated in the following, 
based on Ray Brown’s z-scores given above: 


z-score T-score ( = 50 + 102) 


Algebra Sorat NS SNE + 0.41 54 
Problems of Democracy ...... — 0.20 48 
Mechanics of Written English . 0.00 50 
Attitude Toward High School -+ 0.50 55 


From this it can be seen that positive z-scores become T-scores 
greater than 50, negative z-scores become T-scores less than 50, and 
a zero z-score is exactly equal to 50. 

Percentiles.—A third method of making test scores comparable is 
through the computation of percentile rank; this indicates the per- 
centage of all the scores in a group, or frequency distribution, which 
are exceeded by a given raw score. Thus, if a raw score is equivalent 
to a percentile of 50, it exceeds 50 per cent of the scores in a group. 
A raw score equivalent to a ggth percentile exceeds 99 per cent of all 
the scores in a group while a raw score equivalent to the 1st per- 
centile exceeds only 1 per cent of the scores in a group. The soth 
percentile is thus equal to the median; the 75th percentile equals 


536 How to Evaluate 


the upper quartile, O,, and the 25th percentile equals the lower 
quartile, Q,. Percentile equivalents for each raw score are computed 
by arranging the raw scores in order of decreasing magnitude and 
determining for each one the number of scores lying below it. Each 
of these numbers is then divided by the total number of scores, and 
the quotient is multiplied by 100. When the scores are grouped into 
a frequency distribution and all within a given class interval are 
assumed to have the same value, we determine for each class interval 
the total frequency of the scores in the intervals below it. 

For such purposes it is frequently desirable to make a cumulative 
frequency distribution. This is done by adding the frequencies in 
each class interyal to each other, beginning with the class interval 
at the bottom. The frequency in the bottom interval is left as it is. 
The cumulative frequency in the second class interval equals the 
sum of the bottom and the second frequencies. The cumulative 
frequency in the third interval is equal to the sum of its own fre- 
quency and the cumulative frequency in the second interval, and so 
on. This procedure is illustrated in the following, based on the 
figures used in Table 22, Form A of the Purdue Reading Test. 
= ogive for this cumulative frequency distribution is shown in 

ig. 17. 


Frequency Cumulative 


f Frequency 

130-136 — 

123-129 5 100 
116-122 8 95 
109-115 12 87 
102-108 18 75 
95-101 Io 57 
BEN 3 47 
age 9 34 
74-80 9 25 
67-73 7 16 
60-66 7 9 
53-59 2 2 
46-52 I 2 
39745 1 z 


^ 
o 
o 


interpreting Scores 537 


If it is justifiable to assume that the frequency distribution ob- 
tained closely approximates a normal curve, z-scores, T-scores, and 
percentile ranks can all be converted one into another. A z-score of 
1, meaning a raw score falling at one standard deviation above the 
mean, is then known to be equivalent to the 84th percentile, as is 
also.a T-score of 60. 


Measures or. RELATIONSHIP— CORRELATION 


Scattered throughout this book are references to relationships be- 
tween various aspects of pupils, between various methods of evaluat- 
ing these aspects, between successive applications of the same evalua- 
tion device, and others. These relationships have been variously 
named according to the kind of measurements involved. The rela- 
tionship between two applications of the same device obtained either 
by splitting the device into equivalent halves, by retesting with the 
same device, or by applying two equivalent forms of the same 
device, has been considered under the heading of reliability. Rela- 
tionships between measurements obtained with an evaluation device 
and some criterion or external standard of the same aspect of pupils 
or the degree to which scores obtained on the test agree with else- 
where-obtained ideas of the degree of that aspect in a pupil, have 
been considered under the heading of validity. Relationships be- 
tween various aspects of pupils, such as general mental ability and 
scholastic achievement, socio-economic environment and emotional 
adjustment, have been considered in the discussions of the aspects 
of pupils which should be evaluated. 

The classroom teacher may often have occasion to determine the 
relationships between measured aspects of pupils both for the light 
they throw on the nature of the pupils, their learning processes, and 
reactions to teaching procedures, and for the evidence they provide 
concerning the nature of the evaluation devices themselves in terms 
of reliability, validity, and overlapping functions. In the present 
section we shall consider the statistical techniques by which the 
teacher can determine the relationships between the scores obtained 
with evaluation devices and any other source of quantitative de- 
scriptions of pupils. An understanding of these techniques will 
prove useful not only in working with data locally obtained but 


538 How to Evaluate 


also in understanding the literature of experimental education and 
the nature of externally made, standardized evaluation devices. 

The Closeness and Direction of Relationships—The relationships 
with which we are here concerned are those between quantitative 
variables, that is, between magnitudes or degrees of different aspects 
of people or things. These relationships vary in closeness and in 
direction. The closeness of a relationship is the degree to which one 
variable changes as the other changes. For example, as the length 
of one of its sides increases, the area of a square also increases ac 
cording to a completely fixed, close relationship. Or as temperature 
increases, the height of the mercury in a thermometer also increases, 
following the temperature very closely. On the other hand, the 
body weight and general mental ability of a group of people are 
not closely related; that is, intelligence does not increase as body 
weight increases. An intermediate degree is illustrated by the 
relationship usually found between a pupil’s scores on mathematical 
ability and on verbal ability tests; while some pupils excel in both 
tests and others do poorly in both, there are many pupils who differ 
greatly in their standing on the two kinds of ability. 

The direction of a relationship refers to whether one variable in- 
creases as the other variable increases. A positive relationship is 
one in which a high degree of one variable is accompanied by a 
high degree of the other. If the temperature is high, the column 
of mercury will be high; or if a pupil's score on one test of mental 
ability is high, it will be high on another test of mental ability. 


A negative relationship is one in which low scores on one variable , 


are accompanied by high scores on another variable. Thus, the 
relationship between pupils’ scores on a test of mental ability and 
the number of their failures in school subjects is negative. With 
some variables the direction of relationship may differ at different 
stages or levels of one of the variables. Such combinations of both 
positive and negative relationships are called curvilinear. Thus we 
know that physical strength increases with chronological age up to 
4 certain level and then decreases as persons grow older and ap- 
proach old age. Similarly, the length of a shadow decreases as the 
day grows older during the first half and increases during the later 
half of the day. , 

Data Required for Determining Relationships.—How can the 
closeness and direction of the relationship between two variables be 


Tc 


v — ee 


Interpreting Scores 539 


determined? "The first requirement is that the variables themselves 
be obtained. Measurements of the two must be paired in some way. 
Usually the pairing is done on the basis of individual persons, as’ 
when each pupil in a group is measured by two tests. In whatever 
way the variables are paired, the net result should be three columns 
of data, the first denoting the pupil, the second denoting his score 
on one of the variables, and the third denoting his score on the 
other variable. The list of scores on Form A and Form B of the 
Purdue Reading "Test in Table 25 shows a list of paired scores for 
25 students. 

Rank Correlation.—After the paired data have been obtained, 
several methods may be used to ascertain the closeness and direction 
of the relationship between the two series of measures, If the data 
are in the form of ranks rather than raw scores or derived scores, the 
Spearman rank-difference coefficient of correlation (rho) should be 
used, 

The formula for this coefficient is 

6 ZD? 

rho=1— NN 5 
where D — difference between a pair of ranks 
N = number of pairs of ranks 

We may illustrate the use of this formula by applying it to the 
ranks of ten pupils on two different tests, These data are shown 
below, together with the computations necessary for the applica- 
tion of the rank-difference method. 


Rank Rank 
Pupil on on DD 

TestI Test II 
A 6 5 1 1 

6 2D? 

B 5 a5 i freee es 
rao BE AAIE EEES 
D UA ME Mee ep CHAM 
E quel NO *~ 10699) 
F 8 6 2 4 708 
G 10 10 o o iy tae a . MR 
H 2 49 
; Fy ge o dio & dM 
J 1 a PEL 


540 How to Evaluate 


Tanz 25.—Tun COMPUTATION OF r FROM AN ÁssuMED MBEAN OF Zero 


Scoreon Score on : 
Student» Form A(X) Form B(Y) x Y fer 
1 62 65 3844 4225 4030 — 
Wat 115 106 13,225 11,236 12,190 
si 117 98 13,689 9,604 11,466 
4 120 125 14,400 15,625 15,000 . 
5 84 7 7,056 5,329 6,132 
6 87 78 75569 6,084 6,86 . 
7 80 82 ^ 6,400 6,724 6,560 
8 110 90 12,100 8,100 9,900 — 
9 93 95 8649 9,025 8835 — 
10 100 96 10,000 9,216 9,600 — 
11 89 9r 7921 8,281 8,099 . 
12 10r 97 10,201 9,409 9,797 
13 103 100 10,609 10,000 10,300 . 
14 63 7 3:969 5,041 4473 
15 104 94 10,816 8,836 9:776 
16 74 74 5476 5476 5,476 
17 76 81 5,776 6,561 6,156 | 
18 115 125 13225 15,625 14,375 
19 104 103 10,816 10,609 10,712 
20 7 7 5041 5476 5,254 
az 68 66 4624 4356 45488 
2 95 85 9,025 7:225 8,75 — 
23 104 92 10,816 8,464 9,568 
24 86 78 7:396 6,084 6,708 
25 105 98 11,025 9,604 10,290 
z 2,326 2,237 223,668 206,215 214,046 — 
Z 93.04 89.48 
EXY — NM.M, 


7 Vax? — NMJVSY: — NMj 
f 214,046 — 25(93.04)(89.48) 
223,668 — 25(8658.44) V206,215 — 25(8006.67) 
u 399400 E .893 
6623.5968 


r= 


r 


Interpreting Scores 541 


Like all coefficients of correlation, the one yielded by this method 
will approach + r.o to the degree that the relationship between the 
variables is close and positive, — 1.0 to the degree that the relation- 
ship is close and negative, and .oo to the degree that there is an 
absence of any relationship between the two variables. The rank- 
difference method can, of course, be applied to scores if they are 
first ranked, but when the number of cases exceeds say 30, the 
ranking becomes time-consuming. For this and other reasons the 
rank-difference coefficient is seldom used for large numbers of cases. 


Fio. 22.—Scatter diagram of 25 paired scores on 
Forms A and B of the Purdue Reading Test. 


Scatter Diagrams—If the paired variables are in the form of 
scores instead of ranks, the direction of the relationship between the 
variables is most conveniently determined by plotting each pair of 
scores on a two-way table. A scatter diagram for the paired scores in 
Table 25 is shown in Fig. 22. The vertical axis is pointed off in 
regular steps for as many class intervals as are used in the frequency 
distribution of one of the variables, and the horizontal axis is simi- 
larly pointed off for the other variable. j : 

Each pair of scores is then plotted by entering the vertical axis at 


542 How to Evaluate i 


the proper point for the magnitude of the score of one of the 
variables and then moving across horizontally until the proper point 
for the score on the other variable is reached; at this intersection a 
mark is made to indicate the occurrence of one pair of scores, The 
process is repeated until all the pairs of variables have been plotted. 
"The result is a scatter diagram that indicates graphically the close- 
ness and direction of the relationship. j 

Most two-way tables are arranged so that the horizontal axis 
indicates increasing scores to the right. For such tables a positive 
relationship is indicated when the scores tabulated tend to fall in a 
diagonal line from the lower left to the upper right corner of the 
table. A negative relationship is indicated when they fall in a 
diagonal from the upper left to the lower right corner. Absence of 
relationship is indicated to the degree that the tabulated pairs of 
scores tend to fall on neither of these diagonals but rather along a 
horizontal or vertical line or in a circle. These various degrees of 
relationship are shown in Fig. 23. A 

Sometimes the tabulated pairs of scores tend to fall in a curved 
line; while this phenomenon is rather infrequent in the relationships 
obtained with the evaluation devices discussed in this volume, teach- 
ers should be on the lookout for such curvilinear relationships. In 
particular it should be realized that special statistical methods dif- 
ferent from those here discussed are necessary for the determination — 
of the degree of these relationships. 

Pearson Product-moment Coefficient of Correlation —Although— 
the closeness and direction of the relations of scores not in the form 
of ranks can be determined without the use of scatter diagrams of 
the type described above, it is usually advisable to employ them as a 
means of discovering whether the relationship is sufficiently recti- 
linear, that is, non-curvilinear, to justify using the formula for the - 
Pearson product-moment correlation coefficient. If the direction of 
the relationship as determined by inspection of the scores plotted in 
a two-way table reveals no marked tendency toward the curvilinear, 
the direction and closeness may be calculated by means of the 
Pearson product-moment coefficient of correlation. This coefficient 
(r) is fundamentally defined by the following formula, which 
shows it to be the arithmetic mean of the products of the paired 


z-scores: 


A Wal wo amos 


Scores on Test C 


A 301 Uo seg 


544 How to Evaluate 


pe The 
N 
where r= Pearson product-moment coefficient of correlation 
z4 and gy = paired z-scores on each of the variables 
N = number of pairs of scores 


In order to use this formula, however, it is necessary to compute the 
z-score equivalents for each raw score, which in turn involves com- 
puting the arithmetic mean and standard deviation for the distribu- 
tion of each variable, determining the deviation of each raw score 
from its arithmetic mean, and dividing this difference by the stand- 
ard deviation. These quotients are then multiplied in pairs, the 
products added, and the sum divided by the number of pairs. The 
quotient thus obtained is the Pearson product-moment coefficient 
of correlation. 

Fortunately, many other equivalent formulas have been devised 
which circumyent this cumbersome procedure but yield the same 
results. Through algebraic manipulation it can be shown that the 
coefficient defined by the above formula can also be obtained by 
means of the following formula which, although it looks far more 
difficult, is much easier to apply: 


T ZXY — NM-M, 
Vix?  NM3 VY? — NM? 


where X and Y = raw scores on each variable 
M, and M, = arithmetic means of the two distributions 


T 


'The application of this formula is illustrated in terms of the 25 
pairs of measures in "Table 25. The reader should follow through 
the computations in this table for an understanding of the pro- 
cedure. j 

When the data have been grouped into class intervals in a scatter 
diagram, the large numbers involved in the direct use of raw scores 
can be avoided and the computation thereby simplified. Instead of 
each raw score, a number is used that indicates the number of class 
intervals by which the class interval containing the score deviates 
from the lowest class interval. It is unnecessary to multiply by the 
size of the class interval any of the figures used in obtaining the co- 
efficient of correlation from grouped data. 


Interpreting Scores 545 


STATISTICAL SIGNIFICANCE OF MEASURES OBTAINED 


Suppose a teacher has administered evaluation devices to two 
different groups of pupils, computed the arithmetic means of the 
distributions of the scores, and obtained the difference between the 
two arithmetic means. Only rarely will the two averages be the 
same; the average performance of one group will almost surely be 
higher than that of the other. Is the difference obtained due to 
chance fluctuations or does it reflect a real difference in the popula- 
tions of pupils of whom the two groups are samples? 

This question arises because the measures obtained with all meas- 
uring instruments are not perfectly reliable or valid and because 
statistical measures such as means or standard deviations are always 
based upon parts or samples of the total group of people or phe- 
nomena about which it is desired to generalize. A given group of 
pupils is only a small part of the total population of pupils of whom 
we should like to be able to say that an obtained difference holds 
true. Similarly, a single administration of an evaluation device is 
only a small sample of all the many administrations of it of which 
we take the single administration to be typical or representative. 

Whenever we seek to generalize about a whole on the basis of 
our knowledge of a part of it, we are involved in what is known 
as sampling error. This phenomenon is illustrated, traditionally and 
quite comprehensibly, in terms of a bucketful of spherical balls, a 
given proportion of which are white and the remainder black, and 
which differ only in color. Let us say there are 1000 balls in a bucket, 
half of them white and half black. If an experimenter were to 
blindfold himself and pick out one of the balls, the chances would 
be 50-50 that the one selected would be black and, similarly, 50-50 
that it would be white. If this ball were put back in the bucket, the 
balls thoroughly mixed up, and another selection made, the same 
chances for drawing a black or a white ball would prevail. If this 
were done an infinite number of times, the number of white balls 
chosen would equal the number of black ones chosen. But in only 
one hundred choices the proportion of black balls chosen will fre- 
quently differ from exactly one-half. Sometimes out of 100 choices 
as many as 60 will be black, while in other groups of 100 choices 
perhaps as few as 40 will be black. 


546 How to Evaluate 


If we take a great many groups of 100 random choices from the 
bucketful of balls, the arithmetic means of the number of black 
ones per 100 choices will differ from one another and can be placed 
in a frequency distribution. The arithmetic mean of this distribution 
of number of black balls per 100 chosen will approach 50 as the 
number of groups of roo choices is increased, although for any 
single group of 100 choices it may differ from 50. Thus the arith- 
metic mean in the long run will be 50. 

The distribution of numbers of black balls per 100 choices will 
also have a standard deviation, which is computed by a special 
formula and is called the standard error of the proportion. If we 
knew at the beginning the proportion of black and white balls in 
the bucket, we could thus calculate in advance the standard devia- 
tion of the distribution of the proportion of black balls, and by 
using our knowledge of the area relationship under the normal 
curve we could predict the frequencies with which various propor- 
tions of black balls would be obtained in the long run. This is so 
because proportions, arithmetic means, standard deviations, and 
most other statistical measures are known to fall into a normal 
distribution when large numbers of them are calculated on the 
basis of samples containing say 30 or more members drawn at 
random. 

The analogy may now be shown between this illustration and the 
conditions under which teachers or experimenters endeavor to in- 
terpret the statistical measures and the differences between them 
which they obtain. If one wishes to generalize about all of a particular 
group of pupils, say all pupils taught with a certain textbook or all 
pupils of a given age or sex, on the basis of the limited numbers of 
pupils available for evaluation or experimentation, the entire group 
is comparable to the entire bucketful of black and white balls, and 
the available sample of pupils is comparable to the roo balls chosen 
at random jn each group of selections, 

Similarly, if a teacher wishes to generalize, on the basis of a single 
administration of the test, about the scores obtained by a group of 
pupils every time it is administered, or to draw conclusions con- 
cerning a pupil’s “true” score on a test on the basis of a single fallible 
administration or form of the test, the true score is comparable 
to the known proportions of black and white balls in the bucket, 


Interpreting Scores 547 


and the single form 'or administration of the test is comparable 
to a single group of roo selections of balls. Thus, if another sample 
of pupils belonging to the class about which one wishes to generalize 
were chosen, it would not be expected that the same values of the 
statistical measures, such as means, would be obtained. 

If many other similar samples of pupils from the same class were 
used, the statistical measures would differ from one another and 
could be placed in a frequency distribution which would itself have 
an arithmetic mean and a standard deviation. If it were possible to 
administer a given test to a single group of pupils a great number 
of times without introducing practice and other effects, the score 
obtained by a single pupil would be expected to fluctuate around 
his theoretical “true” score. The different scores obtained from suc- 
cessive administrations of the same test could also be placed in a 
frequency distribution whose mean and standard deviation could 
be computed, the mean being equal to the pupil's “true” score on 
the given test and the standard deviation being called the standard 
error of measurement or the standard error of the score. 

Such standard errors (1) of measurement, (2) of means, (3) of 
standard deviations, and (4) of differences between means or stand- 
ard deviations are all to be considered as themselves standard devia- 
tions of distributions of scores, means, standard deviations, and 
differences, obtained from great numbers of random samples drawn 
from the same population. The usefulness of these standard errors 
depends on the fact that scores and statistical measures based on 
random samples are known to fall into a normal distribution if a 
sufficient number of samples are considered, If we know that the 
form of distribution is normal and if we can compute its standard 
deviation, then by using the area relationships under the normal 
curve we can estimate the frequency with which various scores, 
arithmetic means, standard deviations, or differences between means 
or standard deviations will occur if we continue to draw samples 
beyond the single one upon which our knowledge is usually based. 

Knowing the standard deviation of a distribution of statistical 
measures and that the form of the distribution may be considered 
normal, we can proceed to estimate the probability that a given 
statistical measure deviates only so far from what would be expected 
on the basis of chance that its deviation may be ascribed to fluctua- 


548 How to Evaluate 


tions due to random samplings. The probability that the statistical 
measures obtained could not have arisen by chance may then be 
considered the level of confidence with which we may accept it as 
true of the whole population. 

Thus, suppose we set up the hypothesis that there is no real 
difference between statistical measures, say arithmetic means, of a 
certain aspect of pupils such as intelligence in two distinct popula- 
tions, say boys and girls. Such hypotheses, namely, that the true dif- 
ference between two statistical measures based on complete popula- 
tions is zero, are called null hypotheses. Yet we usually obtain a 
difference greater or less than zero. If the null hypothesis is borne 
out, then this difference will not be greater or less than zero by an 
amount which could not be due merely to the fluctuations of random 
sampling. If the null hypothesis is disproved, the difference based 
on random samples drawn from the population must be considered 
to indicate a real difference greater or less than zero in the 
population. 

How can we test the null hypothesis? We can do this if we 
have (1) the obtained difference between the statistical measures 
based on random samples of the population and (2) the standard 
error, or the probable error, of the difference between the statistical 
measures, which is equivalent to the standard deviation of the dis- 
tribution of such differences obtained on the basis of a great number 
of similar random samples. The difference is readily computed by 
subtracting the smaller of the obtained statistical measures from the 
larger. The standard error, S.E., of the obtained difference must be 
computed by means of special formulas whose derivation is beyond 
the present discussion. The formulas for the standard errors of 
various statistical measures are given below. Each of these can be 
converted into the equivalent probable error, P.E., by multiplying 
the standard error by .6745. That is, 


PE. (statistic) = 6745 S-E. (statistic) 


Pa, ar at 


area. rte 


Interpreting Scores 549 


S.E.. = Z ample), 
G) AMEN 


(^ SBo= Tm = 487 SE.y. 


G2) S.E., -m = 
©) S.E., Lo) T 


E. .E. E. 
zn I 
€) S E, = Tw S.E.,, -x e 


To test the null hypothesis with respect to an obtained statistical 
measure, whether this be an arithmetic mean, a median, a standard 
deviation, a semi-interquartile range, an obtained difference between 
arithmetic means or standard deviations, or a coefficient of correla- 
tion, we determine the number of standard errors above or below 
zero at which the measure would fall in a normal distribution, This 
is done by forming a fraction, or critical ratio, whose numerator is 
the obtained difference and whose denominator is the standard error 
of the difference. Then by using the table of areas under the normal 
curve included between the mean and ordinates erected at various 
standard deviation distances along the range above and below zero, 
we can determine the probability that the obtained difference could 
have occurred in a population of differences whose mean is zero. If 
the difference is shown to fall at a point so many standard errors 
from the mean of zero that only five out of 100 or one out of roo such 
measures could have occurred through fluctuations in random sam- 
pling, the measure obtained is said to be significant or very signifi- 
cant respectively. "The computations for two illustrative cases are 
shown in Table 26. 


TABLE 26 
Illustrative Problem I: Range within which the obtained mean may be 
expected to deviate from the true mean. 
Mean of a Class on an Attitude Scale 


M = 8.67 o= 2.4 N = 36 
m nn ita Seah ds 
Lau mv er 


P.E.y = .6745 SE. = .6745 C4) = 27 


550 How to Evaluate 


TABLE 26 (Continued) 


"Therefore the chances are 50 out of roo that the obtained mean devi- 
ates from the true mean .27 score points. 

The chances are 68 out of 100 that the obtained mean deviates from 
the true mean .4 score points. 

The chances are about perfect (99+ out of roo) that the obtained 
mean does not deviate from the true mean by more than three S.E. m's, 
or by more than 1.2 score points. 


Illustrative Problem II: Significance of a difference between means 


Scores on an Arithmetic Test 
Class A Class B 
N = 36 N = 49 
M=75 M = 82 
c= 6 o = 14 
6 I 
gti Abele ur Bes ERA. ud 
M eN/N: 736 nu 15 


Difference between means = Mp — Ma = 7 


SE.qrg c = VOE, + Ear)! =V FO} 9 V5 = 2.236 
S s Mg — M 
GHbicabratio e cB A ea. 
; SE.wpg.uMj) 2236 »5 


7 


Y 


Q 2.236 4.472 6.708 


Interpretation: A critical ratio of 3.13 will occur less than once in one 
hundred times (see Table 24) in a distribution of differences between 
means when the true difference between means is zero. Consequently 
the obtained difference belongs not in this distribution but rather in a 
distribution of differences whose mean is greater than zero. The null 
hypothesis has thus been disproved. 


Derivep Scores 
"The major function of the statistical techniques thus far described 
is the interpretation of scores obtained by pupils on educational and 
psychological tests. Of these techniques the measures of central 


Interpreting Scores 551 


tendency and of variability are most important in setting up the 
norms in the light of which raw scores can be interpreted. A derived 
score is a raw score expressed in terms of norms. Norms are the 
levels of performance on a test attained by defined groups of pupils. 
"They are to be distinguished from standards in that they describe 
what is rather than what should be. The distinction between norms 
and standards has already been discussed in Chapter XXI. 

Among the various types of norms with which a teacher may ex- 
pect to work in interpreting scores on standardized tests are (1) age 
norms, (2) grade norms, (3) percentile ranks, (4) standard scores, 
and (5) quotients. Each of these will be described briefly and their 
various advantages and disadvantages considered. 

Age Norms.—Age norms are obtained by giving a test to repre- 
sentative groups of pupils at various age levels and computing 
measures of central tendency of the distribution of the scores ob- 
tained in each age group. It can then be said that a given raw score 
indicates a level of performance typical of a certain chronological 
age. For mental ability tests the raw score is interpreted as a mental 
age, for reading tests as a reading age, and so on for whatever other 
aspects the test purports to evaluate. Thus, if the average score of all 
ten-year-old pupils on a test is 46, any pupil who scores 46 on that 
test will receive a derived score of ten years on it, regardless of his 
actual chronological age. By assuming that there is a regular increase 
in test score from one age level to the next, some authors of standard 
tests provide norms for intermediate age levels, say ten years and 
three months. The test score obtained by adding one-fourth of the 
difference between the eleventh- and tenth-year raw scores to the 
latter would be a raw score equivalent to ten years and three 
months. 

Age norms have the advantage of being readily understandable, 
especially for interpreting evaluations which are highly correlated 
with age, such as mental ability, reading ability, and various skills 
taught in the elementary grades. Their utility is limited, however, 
for other aspects of pupils, especially those related to the instructional 
objectives of secondary schools and other higher educational levels. 
Similarly, very high or very low scores are difficult to interpret in 
terms of age norms because the latter usually do not go beyond fairly 
intermediate raw score levels. In this case it is necessary to “extra- 


552 How to Evaluate 


polate,” which generally involves many perhaps unwarranted as- 
sumptions. . 

A third disadvantage of age norms is their dependence upon 
administrative policies with respect to age at school entrance, retarda- 
tion and acceleration, and other factors determining the selection 
of the pupils upon whom the norms are based and conversely the 
interpretation of the scores of those being evaluated. If the sample 
used to establish norms for the ten-year level is composed of pupils 
who have made regular progress through school, the norms will be 
different from those based on ten-year-olds whose progress has been 
retarded. Unless the proportions of retarded, normal, and accelerated 
pupils of a given chronological age are well defined, the age norms 
based on them will lack real meaning. 

Grade Norms.—Grade norms are those which enable the in- 
terpretation of a pupil's raw score as equivalent to that achieved by 
typical pupils at a given grade level. These norms are obtained by 
giving a test to representative groups of pupils at various grade levels 
and computing measures of central tendency for the pupils in each 
grade. The average score of the group in a given grade can then 
be used in interpreting subsequent raw scores as equivalent to the 
test performance of that level. As for age norms, the validity of grade 
norms depends greatly upon the selection of the schools and pupils 
used in deriving them. Large numbers of schools in widely scattered 
areas whose promotion and retardation policies are fully considered 
and whose pupils’ mental ability is well defined must be used if 
grade norms are to be considered valid for interpreting scores on 
tests given throughout the nation. 

Norms for intermediate grade levels, say for the fifth month of the 
fifth grade, are usually obtained by a process of interpolation similar 
to that for age norms and based on the same assumption of uniform 
increase in raw score from one grade level to the next. 

Grade norms in elementary schools are dependent upon the 
relative emphases given to various subjects at different grade 
levels. If a subject is emphasized in one grade and neglected in 
another, the rate at which increases in raw scores on a test in that 
subject may be expected from one grade level to the next may be 
considerably affected, especially in so far as the applicability of the 
norms established in other school systems is concerned. 


Interpreting Scores 553 


Furthermore, grade norms are particularly liable to the danger of 
becoming standards to be attained by all pupils; the violence thus 
committed to the principle of individual differences among pupils 
in all aspects concerned with the achievement to be expected from 
them has been emphasized in this volume. A. further disadvantage 
of grade norms is their inapplicability to extreme scores obtained 
by pupils in the lowest grade and the highest grade for which 
norms are provided. Raw scores beyond these limits can usually be 
interpreted only vaguely. 

In the secondary school grades, grade norms are usually sup- 
planted by semesters-of-study norms, a given raw score being in- 
terpreted as the average for pupils who have studied a subject for a 
certain number of semesters. Here again the chronological age level 
of the pupils used in deriving the norms must be specified. 

Percentile Ranks.—Percentile ranks are norms by which a raw 
score can be interpreted as superior to the raw scores obtained 
by a certain percentage of pupils. Thus, if a raw score of 46 has a 
percentile rank of 9o, 9o per cent of the pupils received raw scores 
lower than 46 and ro per cent had raw scores of 46 or higher. The 
method of computing percentile ranks has already been described. 

Percentile rank norms have the advantage of easy interpretability. 
They depend just as completely, however, upon the description of the 
group of pupils used in obtaining them as do age and grade norms; 
the group must be fully described in terms of all aspects which are 
related to performance on the test. Thus it is usually necessary to 
specify the measures of central tendency and variability of the dis- 
tribution of chronological age, mental age, grade placement, 
semesters of study, and socio-economic status of the groups used in 
deriving norms for standardized achievement tests. 

Percentile ranks suffer from the unique disadvantage that they 
inevitably result in unequal units along the scale of performance on 
a test. Such units are undesirable for two reasons. First, a false im- 
pression is given the unwary concerning the differences in per- 
formance between scores say at the 5oth and 55th percentile levels 
and at the goth and osth levels. The latter, for all bell-shaped dis- 
tributions of raw scores, is always a larger actual difference than 
the former, even though the differences in percentile rank are equal. 
Second, percentile ranks are not arithmetically manipulatable for 


554 How to Evaluate 


the computation of means and standard deviations. Thus, if two 
raw scores are averaged and the percentile rank of the average is 
determined, it will differ from the mean obtained when the per- 
centile ranks of the two raw scores are first determined and then 
averaged. Another disadvantage is the variation in reliability of dif- 
ferent ranks. Despite these disadvantages, percentile ranks are so 
readily understood that they are used as norms for many standard 
tests. 

Standard Scores.—Standard scores are based on the principle of 
the z-score and T-score already discussed. "They avoid the dis- 
advantage of unequal units involved in percentile norms. Standard 
score norms transform the distribution of raw scores into a dis- 
tribution with a specified measure of central tendency and vari- 
ability. Thus, T-scores always have an arithmetic mean of 50 and a 
standard deviation of ro; hence units of T-scores are always equal to 
one-tenth of the standard deviation. 

Although not so readily understood by most test users, standard 
scores have the advantages (r) of providing equal units along all 
parts of the scale of performance on a test and (2) of being based on 
the most reliable statistical measures obtainable from a frequency dis- 
tribution, namely, the measures of central tendency and of variability. 
However, there is still the necessity of specifying the nature of the 
group of pupils on whom the norms are based, and this renders 
standard scores liable to the danger already discussed in connection 
with age, grade, and percentile norms. When standard scores are 
interpreted in terms of percentile ranks, they always involve the 
assumption that the raw scores are distributed according to the 
normal curve. The degree to which this assumption is justified will 
determine the interpretability of standard scores in this way. 

Quotient Norms.—Quotients are frequently used in connection 
with age norms to obtain a measure of the relative status of a 
pupil’s performance. The best known of the quotients, the intel- 
ligence quotient ot LQ., is obtained by dividing the pupil’s mental 
age by his chronological age and multiplying by roo. That is, 


IQ.— 100 X Gar Since for the typical child mental age equals 


chronological age, intelligence quotients oyer 100 indicate mental 
ability superior to that of the average child and those below roo in- 


Interpreting Scores 555 


dicate inferior mental ability. The intelligence quotients obtained 
with most group tests of mental ability have been in most cases 
equated to those obtained with the Stanford-Binet individual in- 
telligence test. 

In general, the aim has been to arrange the norms so that the 
intelligence quotient remains constant for a given pupil as he grows 
older. This goal has been only partially realized because of the fact 
that the standard deviation of the distribution of mental ages in- 
creases in general for groups of pupils of increasing chronological 
age. This results in increasing L.Q/s for pupils with I.Q.’s greater 
than 100 and in decreasing LQ''s for pupils with I.Q/'s less than 100, 

Furthermore, the inapplicability of the mental age concept for 
chronological age levels greater than about fifteen years makes in- 
telligence quotients of doubtful meaning for older pupils and adults. 
Mental age does not increase with chronological age throughout an 
individual's lifetime; the curve of mental growth levels out some- 
where during early adulthood for most persons. Consequently, 
since mental age cannot keep pace with chronological age, the I.Q. 
should decrease as a person grows older unless some limit is placed 
upon chronological age. For the Stanford-Binet intelligence test, 
and consequently for most group intelligence tests, chronological 
age is held constant after an individual reaches the age of about 16. 

The educational quotient, E.Q., can be determined whenever an 
educational age or subject-matter age is known. The educational 
quotient then equals educational age divided by chronological age 
and multiplied by 1oo. Its advantages and limitations are similar to 
those of the intelligence quotient, especially as far as the limitations 
of age norms discussed above are concerned. 

The accomplishment quotient, A.Q., represents an attempt to ob- 
tain a measure of whether a pupil is achieving instructional objectives 
to the limits of his mental ability. It is equal to educational age 
divided by mental age and multiplied by roo. If the tests used to de- 
termine educational age and mental age were not administered 
at the same time, the educational quotient and intelligence quotient 
may be substituted; the accomplishment quotient then equals 


EQ 


Q X 100. 


The initial promise of the accomplishment quotient has not been 


556 How to Evaluate 


realized for several reasons. (1) Not only must it inevitably be based 
upon imperfectly reliable tests, but it accentuates the unreliability 
of the tests upon which it is based. (2) The age norms for the two 
tests used in determining the A.Q. are seldom derived from the 
same group of pupils and hence the norms are seldom strictly 
comparable. (See the discussion of comparable norms below.) 
(3) Because the curricula of most schools are geared to the pupils of 
average or below average mental ability, pupils of superior mental 
ability will usually be found to possess educational achievement 
below expectation. while pupils of below average ability will have 
A.Q/s above 100. Teachers can check this general finding by 
dividing their pupils into low- and high-ability groups and com- 
puting A.Q'5 for each group. Finally (4) it is now well known that 
a battery of achievement tests and a measure of LQ. are so closely 
correlated (about .9) as to make it extremely dubious whether the 
two types of tests are not merely measures of each other. In other 
words, the “jangle fallacy” mentioned earlier is largely responsible 
for the concept of the A.Q. 

Comparable Norms.—It has frequently been found that pupils who 
receive a given derived score on one test do not receive the same 
derived score on other tests purporting to measure the same aspect. 
Thus for a given group of pupils the average I.Q. according to the 
norms of one test may be 105, according to those of another test 115, 
and according to those of a third 95. Similar discrepancies have been 
reported for tests of educational achievement—pupils are above 
average in achievement according to the norms of one test and 
below average according to those of another test. The major cause 
of this lack of comparability between the norms of different stand- 
ardized tests of the same aspect of pupils is probably the difference 
in the groups of pupils used in establishing the norms. Different test 
authors almost inevitably use different samples of the total pupil 
population in deriving norms for their tests, and the samples usually 
differ from one another in the degree to which they possess the 
achievement or ability evaluated by the test. Consequently, the 
sample used for standardizing one test may be higher or lower in 
ability or achievement than those used to obtain derived scores on 
different tests, although the tests claim to measure the same pupil 
aspect. 


Interpreting Scores 557 


Similarly, if guidance purposes make it desirable to ascertain 
how the different aspects of a pupil are related to one another, the 
norms for the tests of the different aspects must be comparable. Thus 
it may be desired to ascertain whether a pupil’s mathematical 
ability is superior to his verbal ability, or whether his achievement 
in chemistry is superior to that in French. Unless the norms for these 
various tests are known to be comparable, that is, have the same 
meaning with respect to level of ability or achievement, it is im- 
possible to draw conclusions concerning intertrait differences. 

Not only are similar samples of pupils required in the standardiza- 
tion of tests if the norms are to be comparable, but also allowances 
must be made for differences in reliability of the tests and in the 
shape of their frequency distributions. 

"These considerations and the fact that many investigators have 
found comparability between test norms to be the exception rather 
than the rule for the great majority of present-day tests should lead 
users of standardized, commercially available evaluation devices to 
take certain precautions. 

In the first place, and perhaps most important of all, preference 
should be given to tests whose norms have been made comparable 
to those of other tests. Such tests have all been standardized on one 
group of pupils or else their norms have been equated to one an- 
other through an anchor test.’ Information relevant to this considera- 
tion is usually contained in the manuals or literature of the tests 
that offer this great advantage in interpretability. In the second 
place, test users should not accept test norms at face value; rather 
they should attempt to secure and evaluate all available data con- 
cerning the sample of pupils used in deriving them. In the third 
place, when selecting standardized tests of a given aspect of pupils, 
preference should be given to the tests which have been standardized 
with large samples of pupils carefully selected for their representative- 
ness and which give evidence of having been submitted to care- 
ful statistical analyses designed to yield reliable and meaningful 
norms. . 

In the fourth place, many local factors should be taken into account 
when interpreting the standing of pupils according to norms derived 
on a nation-wide basis. Among these factors are (1) the legal age of 

1 An anchor test is one to whose norms the norms of other tests are equated. 


558 How to Evaluate 


school entrance, (2) the average age of actual school entrance, 
(3) promotion and retardation policies, (4) the rate and selectivity 
of elimination from school, (5) the grade placement, time allow- 
ances, and general nature of the curricula, (6) the efficiency of the 
teaching personnel, (7) the composition of local pupils in terms of 
mental ability and other aspects related to the one being evaluated, 
(8) the relative emphasis in the local school situation on academic, 
social, and vocational development, and similar factors. The meaning 
of the derived scores for a given group of pupils may then be 
interpreted in the light of these factors. 

It is obvious that in many practical situations it will be necessary 
to use norms whose applicability to local conditions is questionable 
or unknown. 'To the extent that these properties of evaluation de- 
vices remain undetermined, the data obtained with them cannot 
be interpreted meaningfully for the individual pupils with whom 
guidance is concerned. A way out of this difficulty may be found, 
however, if local norms are computed. These should be derived in 
terms of percentiles or standard scores rather than age or grade, 
unless the computation of standard age or standard grade norms 
is possible. It is to enable the computation of local norms that the 
statistical techniques described earlier have been presented. Need- 
less to say, such norms must also be obtained for teacher-made tests. 


SUMMARY 


Test scores can be made comparable by means of various derived 
scores such as z-scores, T-scores, or percentiles. The closeness and 
direction of relationships between paired variables can be ascertained 
by means of rank-difference or product-moment coefficients of 
correlation and by scatter diagrams. Determining the degree to which 
statistical measures obtained may be subject to the fluctuations of 
random sampling requires the computation of standard errors and 
levels of confidence as a means of testing null hypotheses. Derived 
scores may take the form of age, grade, or percentile norms, standard 
scores, or quotients. Norms in any form become meaningful only 
in so far as the sample of pupils used for deriving them is fully 
specified and described. Norms are most useful when comparable 
from one aspect of pupils to another. Achieving comparability re- 
quires either an anchor test or the use of the same pupils in es- 
tablishing the norms of all tests to be made comparable. 


Interpreting Scores 559 


QUESTIONS 


1. Below are the raw scores obtained by a class of 36 pupils on six tests 
of various aspects of the pupils. What interpretations could be made 
of these data for the purposes of guidance and for the determination, 
in terms of interrelationships, of the meanings of the test scores? 
Specify the statistical measures you would use and your reason for 
using each. Consider the applicability and worth-whileness of rank- 
ing, graphic representation, measures of central tendency, measures 
of variability, measures of relationship, and various kinds of derived 
scores. Select specific pupils and describe their guidance needs as 
revealed by their scores. 


General Mental Adjustment Scorefor Score for Score for 
Scholastic Ability Scoreon Attitude Socio- ^ Mechanical 


Pupil Achievement Test Self- Toward economic Ability 
Score Score inventory High School Status Test 
A 4 63 53 3 49 36 
B ^ 82 63 7 7 42 
c rs 9o 35 2 29 56 
D 7 IOI 3 6 15 96 
E 35 67 98 4 18 38 
F 47 72 2 1 19 49 
G 12 153 64 9 45 57 
H 136 173 85 11 39 16 
I 66 95 58 4 32 78 
I 54 78 34 4 5 19 
K 71 88 13 7 7 44 
L 108 131 95 9 1 84 
M 114 134 18 11 27 82 
N 12I 147 17 6 3 50 
[9) 25 48 Ir 1 14 83 
P 158 165 9 44 40 
Q 87 110 88 4 25 96 
R 96 123 45 6 18 88 
S 103 114 8 25 3 
T 142 160 43 9 30 50 
U 76 94 50 6 29 55 
M 93 110 22 7 34 59 
wW 87 108 96 6 48 48 
x 92 113 3X 4 16 66 
Y 79 gz 78 8 19 68 
Z 52 80 84 4 3t 83 
AA 96 118 36 6 41 16 
BB 114 n3 19 9 29 3 
cc 102 119 n 6 39 Az 
DD in 140 55 8 13 96 
EE 83 100 53 3 19 $7 
FF 66 97 51 4 36 75 
GG 113 129 37 9 2X 3 
HH 97 108 39 8 14 97 
II 68 89 93 7 26 15 
JJ 84 108 46 9 II 2e 


560 How to Evaluate 


2. Design an experiment involving specific statistical measures to test —— 
the hypothesis that arranging test items in order of difficulty has 
no significant effect on test difficulty, reliability, or validity. 

3. The belief has been expressed that pampered and, overdependent 
children are especially prone to fail in mathematical studies in school. 
How would you determine experimentally the validity of this belief, — — 
and what statistical measures would you employ? A E 

4. What statistical measures would be involved in an experiment to —— 
test the hypothesis that attitudes toward “out-groups,” or people 
different in national and cultural background from oneself, tend to - 
become more heterogeneous in groups of persons as they increase in — 
age? E 

5. Do groups of pupils become more consistent with themselves as . $ 
they increase in maturity? What statistical techniques would you use — —- 
to answer this question? ' 


REFERENCES 


Many books are available which cover more extensively, with fuller —— 
illustrations and explanations, the use of statistical methods in educa- 
tional measurement and evaluation. Among them are: 


1. Garrett, H, E., Statistics in Psychology and Education, New York: 
Longmans, Green, & Company, second edition, 1937. 

2. Lindquist, E. F., 4 First Course in Statistics, Boston: Houghton | 
Mifflin Company, revised edition, 1942. 

3. Peters, C. C., and Van Voorhis, W. R., Statistical Procedures and .— 
T heir Mathematical Bases, New York: McGraw-Hill Book Company, 
Inc., 1940. 


IO. 


Appendix 


PUBLISHERS 


Alma Jordan Knauber 

3331 Arrow Ave. 

Cincinnati, Ohio 

American Council on Educa- 
tion 

744 Jackson Place 

Washington, D.C. 

Association Press 

347 Madison Ave. 

New York City 

Bureau of Educational Meas- 
urements 

Kansas State Teachers College 

Emporia, Kansas 

Bureau of Educational Re- 
search and Service 

State University of Iowa 

Iowa City, Iowa 

Bureau of Publications 

"Teachers College 

Columbia University 

New York City 

California Test Bureau 

3636 Beverly Boulevard 

Los Angeles, California 

Character Research Institute 

Washington University 

St. Louis, Missouri 

C. H. Stoelting Co. 

Chicago, Illinois 

Committee on Aptitude Test 

Association of American Med- 
ical Colleges 

561 


II. 


12. 


13. 


14. 


I5. 


16. 


Columbia Medical Building 

Washington, D.C. 

Committee on Publications 

Harvard Graduate School of 
Education 

Cambridge, Massachusetts 

Co-operative Test Service 

500 West 116th St. 

New York City 

Division of Educational Ref- 
erence 

Purdue University 

Lafayette, Indiana 

Educational Test Bureau 

720 Washington Ave., S.E. 

Minneapolis, Minnesota 

Eugene R. Smith 

Beaver Country Day School 

Chestnut Hill, Massachusetts 

Evaluation in the Eight-Year 
Study 

Progressive Education Associ- 
ation 

6010 Dorchester Ave. 

Chicago, Illinois 

Extension Department of the 
Training School 

Vineland, New Jersey 


. Harvard University Press 


Cambridge, Massachusetts 
Houghton Mifflin Company 
2 Park St. 

Boston, Massachusetts 


2. 
28. 


Appendix 


Keystone View Co. 
Meadville, Pennsylvania 
Lafayette Printing Co. 
Lafayette, Indiana 
Marietta Apparatus Co. 
Marietta, Ohio 


. Mechanical Engineering Dept. 


University of Minnesota 
Minneapolis, Minnesota 
McKnight and McKnight 
109-111 West Market St. 
Bloomington, Illinois. 


. Ohio College Association 


Ohio State University 
Columbus, Ohio 
Psychological Corporation 
522 Fifth Avenue 

New York City 
Psychological Institute 
Washington, D.C. 

Public School Publishing Co. 
Bloomington, Illinois 

RCA Manufacturing Co., Inc. 
Camden, New Jersey 


30. 


31. 


32. 


33. 


M 


35. 
36. 
37. 


38. 


Science Research Associates 

1700 Prairie Ave. 

Chicago, Illinois 

Scott, Foresman and Co., 

623 S. Wabash Ave. 

Chicago, Illinois 

Sheridan Supply Co., 

P. O. Box 1009 

Lincoln, Nebraska 

Soen G. Nilsson 

16 Maverick Road 

Worcester, Massachusetts 

Stanford University Press 

Stanford University, Califor- 
nia 

University of Chicago Press 

Chicago, Illinois 

West Publishing Co. 

St. Paul, Minnesota 

Winnetka Educational Press 

Horace Mann School 

Winnetka, Illinois 

World Book Company 

Yonkers, New York 


| 


INDEX OF SUBJECTS 


Abilities, mental, see Mental Abili- 
ties. 

Ability to interpret data, Test 2.5, 
184-185 

Academic curriculum, intelligence of 
students in, 63 

Acceptance-rejection, 429, 432 

Accomplishment quotient, 287, 455, 
555-556 

ACH Index of Nutritional Status, 


259 
Achievement of instructional objec- 
tives, 17, 19-37; aptitude and, 65- 
66; evaluation of, 124-257; socio- 
economic status and, 109 
Acne, 264 
Adjustment, emotional and social, 17, 
70-83 
community environment and, 111- 
112 
dimensions of, 78-81, 339-341 
evaluation of, anecdotal behavior 
records in, 377-385; difficulties 
of, 337-346; importance of, 336— 
337; rating methods in, 361-375; 
techniques available, 346-357; 
tests of conduct, knowledge, and 
judgment in, 375-377; self-in- 
ventories in, 347-361 
further aspects of, 77-81 
manifestations of, 74-77 
nature of, 70-74 
socio-economic status and, 109-110 
Administrability, of intelligence test 
material, 288; of tests, 208-210 
Administration, of evaluating device, 
“122 
of evaluation program, open-book 
examinations, 495-496; oral, 
497-503; rapport, 486-489; 
scheduling, 489-495; scoring, 
503-507 
563 


Age norms, 551-652 
Algebra tests, 305, 306 
Allport Ascendance-Submission Re- 
action Study, 357 
Allport-Vernon Study of Values, 
4137415, 417, 420 
Alternate forms method, 202-203 
Ambivalence, 429, 432 
Amenability of traits to evaluation, 
366 
American Association of School Ad- 
ministrators, 508 
American Child Health Association 
School Health Study, 47-50 
American Council on Education, 
377; Personality Rating Scale, 
373, 375; Psychological Examina- 
tion, 289, 296-297, 298, 420, 457 
American Educational Research As- 
sociation, 309 
American Home Scale, 438-441 
American Medical Association, 265, 
492, 508 
"Ammain" scale, 436 
Analogies, test items, 177-178; ad- 
vantages and limitations 177; 
rules and suggestions for con- 
structing, 178 
verbal, test content, 295-296 
Anchor test, 557 
Anecdotal behavior records, 377-185 
Application of Principles in Science: 
Test 1.5b, 181-184 
Appraisal, of educational instrumen- 
jes, $, 11-12; of school, 446- 
447; of teacher, 450-485 
Appreciations, relation to attitudes, 
89 
Aptitude, as achievement, 65-66; cul- 
turally determined, 64; defined, 


64-65 
See also Special abilities. 


564 


A.Q., see Accomplishment quotient. 
Area relationships of the normal 
curve, 529 
Arithmetic, instructional objectives 
in, 32; tests of achievement in, 
304-306 
Arithmetic mean, 520-521, 522, 5235 
standard error of, 547-550 
Around the World, 406, 407 
Art ability tests, 330-331 
Art instruction, product evaluation 
in, 214, 229-231 
Ascendance-submission, 80-81, 357 
Aspects, of pupils, 3, 17; of teachers, 
rated by pupils, 469-470 
Aspects of Personality Test, 
361 
Attitude scales, see Scales, attitude. 
Attitudes, 17 E 
as emotions, 87 
community environment and, 112 
defined, 87-91 
determiners of, 94-97 
evaluation of, as educational out- 
comes, 390-407; as interests, 
407-421, 494, 495 
ideals and, 9o 
importance of, 84—97 
morale and, 89-90 
morality and, 89-90 
mores and, 89-90 
movies and, 113 
organization of, 91-94. 
significant for guidance, 97-100 
socio-economic status and, rro 
taste and, 89 
toward behavior problems, 74-79 
toward tests, 135-136 
Audiometer, 269 
Averages, 518-523; choice between, 
523; uses of, 522-523 
Averaging ratings, 363 
Avocational interests, 99-100 
Ayres Handwriting Scale, 225-227 


350, 


Back, the, 43-44, 273-274 

Baldwin-Wood Weight-Height-Age 
'Tables, 259 

BEC Personality Rating Schedule, 
374. 375 


Index of Subjects 


Behavior, and attitudes, 398; prob- 
lems, see Adjustment. 

Behavior Description form, 382-384 

Bell Adjustment Inventory, 359-360, 


361, 420 

Bernreuter Personality Inventory, 
340, 356-359, 361, 420 

Bernreuter Self-Sufficiency Scale, 
357 

Betts-Keystone Telebinocular, 265, 
268, 303 


Betts Ready-to-Read Tests, 301, 302 

Bibliography of Mental Tests and 
Rating Scales, 307 

Bimodality, 528 

Birth order, 107-108 

Board of Examiners, University of 
Chicago, 8 

Breath, the, 272 

Breslich Algebra Survey Test, 305, 
306 

Brown Personality Inventory for 
Children, 351, 36r 

Bryan-Yntema Rating Scale for the 
Evaluation of Student Reactions, 
476-481 

Business Education Council, 
375 


37^ 


California Tests of Mental Maturity, 
293-294, 297, 298 
Carriage, 41, 261-265 
Central tendency, measures of, 518- 
523; choice between, 523; uses of, 
522-523 
Changing alternatives, in 
scales, 368 
in test items, 151-152, 165-171; 
advantages and limitations, 166- 
168; cause from results, 166; 
common principle or most in- 
clusive, 166; most dissimilar, 
166; result from causes, 166; 
rules for composing, 168-171 
Chapman Oral Intelligence Test, 
497 
Character, relation to attitudes, 91 
Check List for Student Performance 
in Dining-Room Waitress Service, 
232-233 


rating 


Ü 
4 
1 


Index of Subjects 565 


Check List of Student Reactions in 
Finding an Object Under the 
Microscope, 236-238 

Check list of visual behaviors, 265, 
267 

Check-list point-score method, 252- 
254 

Check lists, in product evaluation, 
214; in short-answer tests, 164. 

Chest, 43, 273 

Chickenpox, 264-265, 280 

Child-to-child relationships, 107-108, 
432-434 

Clapp-Young Self-Marking Tests, 
505 

Class interval, 514 

Cleeton’s Vocational Interest Inven- 
tory, 412-413, 417 

Clerical ability tests, 328-329 

Clothing, 44, 274; in home econom- 
ics, 233 

“Cobblestone” theory of mental or- 
ganization, 58-59, 310-311 

Coefficients of correlation, 539-544 

College Entrance Examination 
Board, 7, 248, 252, 253, 254 

College of Science, Literature, and 
the Arts, University of Minne- 
sota, 8 

College success, prediction of, 8 

Color blindness, 268-269 

Commercial courses, intelligence of 
students in, 63 

Committee on Personnel Methods, 
377 

Common observation test material, 
292-293 

Communicable diseases, 44; symp- 
toms of, 277-280 

Community background and environ- 
ment, 111-113; evaluation of, 441- 
446 

Comparability in norms, 141-142, 
533-537, 556-558; Cooperative 
Test Service Tests, 299; Stanford 
Achievement Tests, 299 

Comparisons possible, teacher-made 
vs. standardized tests, 141-142 

Compass Diagnostic Tests in Arith- 
metic, 304-305 

Completion test items, 155-156 


Complex achievement, fitness of, for 
evaluation, 137-138 

Compositions, product evaluation of, 
233-236 

Conduct, tests of, 375-377 

Connections, quantity of, 58 

Constant alternatives, in rating 
scales, 368; in test items, 151, 156- 
165 

Construction, of essay test, 243-250; 
of evaluating devices, 122; of 
short-answer tests, 146-193; re- 
finement of, 139-141 

Cooperative Algebra Test, 305, 306 

Cooperative Effectiveness of Expres- 
sion Test, 302, 304 

Cooperative Intermediate Algebra 
Test, 305, 306 

Cooperative . Mechanics of Expres- 
sion Test, 302, 304 

Cooperative Reading Comprehension 
Test, 302 

Cooperative Test Service, 299 

Correlation, 537-544 

Cost, of administering tests, 135; 
of standardized tests, 143, 209 

Courses of study, 22 

C-R Opinionaire, 406, 407 

Criteria for validity, of achievement 
evaluations, 197-200; of adjust- 
ment evaluations, 345-346; of at- 
titude evaluations, 397-404; of 
intelligence tests, 284-289 

Critical ratio, 549 

Cumulative frequency distribution, 
517-518, 536 

Curriculum, 1, 22; construction of, 
27; pupils attitudes and fitness 
for, 86 

Curvilinear relationship, 538, 542 

Cyanosis, 273 


Davis-Schrammel Elementary Eng- 
lish Test, 302, 304 

Deduction factor, 313 

Derived scores, 550-558 

Descriptive rating scales, 366-367 

Desires, see Motives. 

Detroit —First-Grade 
Tests, 298 


Intelligence 


566 


Detroit Mechanical Aptitudes Ex- 
amination, 326, 327 

Diagnosis, 9, 10 

Diagnostic Teacher-rating 
472-476 

Dictionary of Occupational Titles, 
66-67, 314-316, 320 

Difficulties in evaluating adjustment, 
337-346; dimensions of adjust- 
ment, 339-341; frank responses, 
338; individualizing interpreta- 
tion, 341-345; insightful responses, 
338-339; validating through inde- 
pendent criteria, 345-346 

Difficulty, of essay questions, 250; of 
intelligence tests, 287-288; of test 
items, 189-190 

Diphtheria, 277, 280 

Direction of relationship, 538 

Directions for administering tests, 
adherence to, 502; in oral testing, 
499-500; to administrator, 191- 
192; to pupil, 19x 

Disarranged sentences test material, 
295 

Diseases, communicable, see Com- 
municable diseases. 

Disguised evaluation devices, ad- 
justment inventories, 349; attitude 
questionnaires, 399 

Distorting effects of tests, 136 

Dominance-submission, 429, 432 

Double grading, of essay tests, 255; 
of short-answer tests, 504 

Drake Musical Memory Test, 330 

Drawing, evaluation of, 229-231 

Drawings, Mechanical, Rating Scale 
for, 233 

Dunlap Academic Preference Blank, 
411-412, 417 


Scale, 


Eames Eye Test, 265, 267-268 

Ears, the, 42, 269-271 

Economic activities, attitudes toward, 
99 

Economic conditions, 
pupils, rir 

Eczema, 264 

Editing items, 189 

Educational age, 555 

Educational attitude objects, 99 


influence on 


Index of Subjects 


Educational quotient, 555 
Educational research, 12 
Eight-Year Study of the Progressive 
Education Association, 24, 28, 137, 
145, 180, 187, 382, 405 
Elementary school level, objectives 
at, 30-32 
Emotional adjustment, see Adjust- 
ment. 
Energy, general mental, theory of, 
59 
Engineering aptitude tests, 332 
English, instructional objectives in, 
33; product evaluation in, 214; 
specificity of achievement in, 20; 
usage tests, 302, 304 
Environment and background, 17 
community 111-113; evaluation 
of, 441—446 
home, 104-111 
child-to-child relationships, 107- 
108; evaluation of, 432-434 
parent-child relationships, 105- 
107; evaluation of, 428-432 
parent-to-parent relationships, 
104-105; evaluation of, 426— 
428 
socio-economic status, 108-111; 
evaluation of, 434-441 
intelligence and, 59-63, 285-286 
school, 114; evaluation of, 446- 


447 
E.Q., see Educational quotient. 
Equal-appearing intervals, method 
of, 220, 223, 225, 391-394 
Equivalent forms method, 202-203 
Essay tests, construction, 243-250; 
grading, 250-256; product evalua- 
tion and, 238; short-answer tests 
vs., 127-138; types of questions in, 
244-247; when desirable, 242-243 
Evaluation, 3; comprehensiveness of, 
3-4 5; continuity of, 3-4; in the 
Eight-Year Study, 180, 187, 405; 
major steps in, 121-123; measure- 
ment and, distinction between, 29— 
30; philosophical basis of, 4-5; 
purposes of, 1-14; staff, 137, 193 
Evaluation devices, achievement, 
122; questions in, 124-125; types 
of, 126 


Index of Subjects 567 


administration of, 122, 486—509 
evaluation of, 123 
purpose of, 121-122 
relation to instructional objectives, 
125 

Evaluation program, administering 
the, 486-509 

Examiners Check List for Use in 
Noting and Interpreting Behavior 
During the Test Period, 290 

Experimental designs, 60-61 

Eyes, the, 41-42, 265-269 


Factorial analysis, 58, 64, 291, 313- 
314, 320-322, 340-341, 357, 413, 
414, 417, 438 

Family, attitudes toward, 99 

See also Home background and 
environment. 

Family-life variable, 431 

Family relationships, and intelli- 
gence, 6r; personal see Home 
background and environment. 

Fascism, attitudes toward, 399 

Features, analysis of product into, 
216-218; amenability to evalua- 
tion, 216-217; grouping of, 217; 
provision for scoring, 218-220; 
weighting of, 218 

Feeble-mindedness, 60 

Feet, the, 44, 274 

Final examinations, exemptions from, 


490-491 
Food products and procedures, 231— 


233 

Foster homes, effect on intelligence, 
61, 62 

Frankness, 338 

Freeman Chart for  Diagnosing 
Faults in Handwriting, 227 

French, specificity of achievement in, 
20 

Frequency, distributions, 
polygon, 517-518 

“Fulcra” of conflict, 343-345 


512-517; 


“g” see General factor. 
Garretson-Symonds Interest Ques- 
tionnaire for High School Stu- 


dents, 412, 417 


General factor, “g,” 59, 62 

General mental abilities, 17, 494; 
evaluation of, 282-309 

See also Mental abilities. 

General Test for Stenographers and 
Typists, 329 

Generality vs. specificity, of abili- 
ties, 57-79; of achievement, 20-22; 
of attitudes, 91-94; of conduct 
tests, 376-377 

Generalized attitude scales, see 
Master attitude scales. 

Geometry tests, 305, 306 

German measles, 264, 280 

Gestalt psychology, 29; conception of 
intelligence, 57 

Glands, 273 

Goiter, 273 

Government, attitudes relating to, 99 

Grade norms, 552-553 

Grading, reliability of, 128-129 

Grading essay tests, 250-256; anony- 
mous, 255; check-list point-score 
method, 252-254; percentage-pass- 
ing method, 250-251; quality scale 
method, 251; sorting or rating 
method, 251-252 

Graphic rating scales, 366-370; rules 
for constructing, 368-370 

Graphic representation of frequency 
distributions, 517-518 

“Gravel” theory of mental organiza- 
tion, 58, 310-311 

Gray Standardized Oral Reading 
Check Tests, 302, 303 

Group intelligence tests, 289 

Grouping test scores, 512—517 

Growth, 41; evaluation of, 259-261 

“Guess-Who” test, 370 

Guessing, 130; corrections for, 130, 
163-164, 170; instructions concern- 
ing, 162-163 

Guidance, 1-14; attitudes significant 
for, 97-100 


Haggerty-Olson-Wickman Behavior 
Rating Schedule, 371-372, 373 
375 

Hair, the, 265 

“Halo” effect, 362-363,, 366 


568 


Handwriting, effect on grading, 129; 
product evaluation of, 214, 225- 
226 

Harper Social Study, 394 

Harvard College, 135 

Harvard-Newton Composition Scales, 
234 

Hay fever, 271 

Health, and socio-economic status, 
109; knowledge, of teachers, 49- 
$0; needs, recognition of, by teach- 


ers, 47 
Hearing, 42; evaluation of, 269-271 
Heart, the, 273 


Height, measurement of, 260-261 

Henmon-Nelson Test of Mental 
Ability, 289, 295, 298 

"Heredity" as determiner of mental 
abilities, 59-63 

Hickey Checklist for Fitted Facings 
on Garments, 233 

Higher mental processes, amenabil- 
ity to improvement through teach- 
ing, 26-27, 187; evaluation de- 
vices for, 180-187 

Hillegas Scale for Compositions, 233, 
234, 236; Nassau County Supple- 
ment, 235 

Histogram, 517-518 

Holmgren color-blindness test, 269 

Home background and environment, 
104-111; evaluation of, 426-441 

Home economics, product evaluation 
in, 214 

Honesty, tests of, 376 

How I Teach inventory, 461-465, 


8 
Hidelson Typical Composition Scale, 
ass 


LE.R. Assembly Tests for Girls, 327, 
328 

Illustrative test material, 291-297 

Impetigo, 263 

Income as index of socio-economic 
status, 436 

Indiana State Board of Health, 281 

Individual differences, 1, 6, 65 

Individualizing interpretation, 341- 
345 


Index of Subjects 


Induction factor, 313 

Industrial arts, product evaluation 
in, 213, 233 

Information concerning tests, sources 
of, 306-307 

Information tests as measures, of 
intelligence, 294-295; of interests, 
410 

Insight, 338-339 

Instincts, relation to attitudes, 89 

Instructional effects of tests, 5, 10- 
11, 132-133 

Instructional objectives, 17, 19-37, 
125, 139 

Instrumentalities of education, ap- 
praisal of, 5, 11-12 

Intelligence quotient, 63, 554-555; 
predictive value, 63; requirements 
in occupations, 63 

Intelligence tests, criteria for con- 
tent of, 284-289; individual, 289- 
290; material included in, 291- 
298; non-language, 291; purpose 
of, 283-284; types of, 289-291; vs. 
general achievement tests, 286; 
when of most value, 287 

Interest Questionnaire, 349 

Interest-Values Inventory, 415, 417 

Interests, avocational or  recrea- 
tional, 99-100; educational-voca- 
tional, evaluation of, 99, 407-421; 
relation to ability, 408; relation to 
attitudes, 88 

International Test Scoring Machine, 


505 

Interpretation, of evaluative data, 
122-123, 210; of individual intel- 
ligence tests, 289-290; of reliabil- 
ity coefficients, 206-208; of test 
scores, 510-560; possible, teacher- 
made vs. standardized tests, 141- 
142 

Interpretation of Data, Test 2.5, 184- 
185 

Interview technique, 427-430 

Intra-group rating devices, 370-371 

Int dividual differences in spe- 
cial abilities, 65 

Introversion-extroversion, 80-81 

Towa Algebra Aptitude Test, 305, 
306 


Index of Subjects 


Iowa Every-Pupil Tests of Basic 
Skills, 302, 304 

Towa Placement Examinations, 
Mathematics Training, 305, 306; 
Aptitude, 306 

Iowa Plane Geometry Aptitude Test, 


305, 306 
Iowa Silent Reading Test, 302, 303 
LQ., see Intelligence quotient. 
Ishihara test for color blindness, 269 


“Jangle” fallacy, 286, 556 

Job families, 315-320 

Joint Committee on Health Problems 
in Education, 281 


Kelley-Perkins How I Teach inven- 
tory, 461-465 

Kent-Shakow Form Board, 325-326 

Kerr-Remmers American Home 
Scale, 438-441 

Knauber Art Ability Test, 331 

Know Your School Series, 445-446 

Kuder Preference Record, 415-416, 


«7 

Kuder-Richardson formula, 187, 204- 
205, 416, 439 

Kuhlmann-Anderson Intelligence 
Tests, 298 

Kwalwasser-Dykema Tests, 330 


Labor, of test construction, 134; of 
test scoring, 134 

Laboratory products and procedures, 
evaluation of, 236-238 

Laird Ca Introversion Test, 357 

Language products, evaluation de- 
vices, 126-127 

Languages, foreign, intelligence of 
students in, 63 

Law aptitude test, 332 

Law of Effect, 9 

Learning, motivation of, $, 9, 1307 


132 

Lee Clark Reading Readiness Test, 
301, 302 

Legs, the, 44, 274 

Lettering, evaluation of, 227-229 

Level, of aspiration, 98-99; of con- 
fidence, 548 


569 


Lewerenz Test in Fundamental Abil- 
ities of Visual Art, 331 

Liberalism-conservatism, 99 

Lice in hair, 265 

Linguistic (L-) score, 297 

Listening habits, 270-271 

"Logical" error in rating, 365 

Loofbourow-Keys Personal Index, 
35%, 361 

Lymph glands, 275 


Maladjustment, see. Adjustment. 
Maller Case Inventory, 3517355, 361 
Man-to-man rating scale, 367n. 
Manual dexterity tests, 327-328 
Manuals for tests, 200 
Master attitude scales, 392-394; com- 
pared with attitude aires, 
396-397; validity of, 397-404 
Master list items, 1647165 
Matching items 171; advantages 
and limitations, 173-174; construc 
tion of, 174-176 
Mathematical abilities, — 298-299; 


304-306 
Mathematics, instructional objec- 
tives, 33-34; product evaluation 
in, 214 
MceAdory Art Test, 330-33! 
Mean, arithmetic, see 
mean. 
Measles, 264, 277 
Measurement, and evaluation, dir 
tinction between, 29-30; standard 
error of, $47 
Mechanical ability tests, 4325-327 
Mechanical Drawings, Rating Scale 
for, 233 
Median, 491, 521525 
Medical Aptitude Test, 932 
Medical inspection of pupils, go 
Meier-Seashore Art Judgment Test, 
330, 53% 
Memory factor, 313 
Mental abilities, 17, $4769 
evaluation, of general, 382-298; of 
mathematical, 304-306; of spe- 
cial, 310-335; of verbal, a9&- 
304; organization of, $7-$9 
Mental age, $51, 554-555 


570 Index of Subjects 


Mental Measurements Yearbook, 
200, 211, 307, 333 

Mental processes, 1, 55, 243 

Mental speed factor, 313 

Metropolitan Achievement Test, 411 

Metropolitan Reading ^ Readiness 
Test, 301, 302 

Microscope, Check List of Student 
Reactions in Finding an Object 
Under the, 236-238 

Minnesota Clerical Test, 312 

Minnesota English Scales, 234-235 

Minnesota Home Status Index, 437, 
438 

Minnesota Mechanical 
Test, 325, 326 

Minnesota Paper Form Board, 325, 
326 

Minnesota Rate 
Test, 327 

Minnesota Score Card for Meat 
Roast, 220 

Minnesota Spatial Relations Test, 
325, 326 

Minnesota Vocational Test for Cler- 
ical Workers, 328, 329 

Moss-Hunt Aptitude Test for Nurs- 
ing, 332-333 

Motivation of learning, 5, 9, 130- 
132 

Motives, 71-73; relation to attitudes, 
88-89 

Mouth, the, 271-272 

Movies and attitudes, 113 

Mumps, 280 

Murdoch Analytic Sewing Scale for 
Measuring Separate Stitches, 233 

Music ability tests, 329-330 


Assembly 


of Manipulation 


‘Nassau County Supplement to the 
Hillegas Scale, 235 

Nation, 396 

National Committee on Teacher Ex- 
aminations of the American Coun- 
cil on Education, 457-461 

National Education Association, 446, 
449 

National Intelligence Test, 497 

National Society for the Prevention 
of Blindness, 265 

National Society for the Study of 


Education, 61-62, 69; Thirty- 
Ninth Yearbook, 62; Twenty-Sev- 
enth Yearbook, 61 

National Survey of Secondary Edu- 


cation, 63 
National Teacher Examinations, 
457-461 


Nature of Proof, Test 5.21, 185-187 
Nebraska Personality Inventory, 340, 
360, 361 
Neck, the, 43, 273 
Negative suggestion, 132 
New Republic, 396 
New York State, 42, 43, 404, 405 
Noegenetic laws, 55-56 
Non-language achievement evalua- 
tion devices, 126-127 
Non-language intelligence tests, 291 
Normal curve, 526-531 
Norms vs. standards, 551 
See also Interpretation. 
Nose, the, 42, 271 
Null hypotheses, 548—550 
Number factor, 313 
Number series test material, 296—297 
Numerical rating scales, 366-367 
Nurse, relationship to teacher, 48 
Nursery schools, effect on intelli- 
gence, 61, 62 
Nursing aptitude test, 332-333 
Nutritional Status Indices, 259 


Objectives, see Instructional objec- 
tives. 
Occupational information, 66-67 
Occupational Information and Guid- 
ance Service, 66 
Occupations, classifications of, 315- 
320; by socio-economic status, 
434-436 
nature of American, 314-323 
O'Connor Finger Dexterity Test, 327, 
328 
O’Connor Tweezer Dexterity Test, 
327, 328 
Ogive, 517-518 
Ohio State University, 21, 27, 289, 
295-296, 298, 457, 506 
Open-book examinations, 495-496 
Operationism in defining intelligence, 
57 


Index of Subjects 


Opportunities in relation to guid- 


ance, 66 

Optional questions, avoidance of, 
248, 249 

Oral administration of tests, 497- 
500 


Ordinal position in family, 107-108 

Organismic view, 29 

Organization, of attitudes, 91-94; of 
mental abilities, 57-59 

Orientation Test, 406, 407 

Orleans Algebra Prognosis Test, 305, 
306 

Orleans Geometry Prognosis Test, 
305, 306 

O'Rourke 
328-329 

O'Rourke Mechanical Aptitude Test, 
326, 327 

Otis Quick-Scoring Mental Ability 
Tests, 298 

Otis Score Card, 218 

Overprotection of child by parent, 


106, 429, 432 


Clerical Aptitude Test, 


Paired comparisons, method of, 221- 
222, 225 

Parent-child relationships, 105-107, 
428-432 

Parent-to-parent relationships, 104— 
105, 426-428 

Pearson product-moment coefficient 
of correlation, 542-544 

Percentage-passing method, 250-251 

Percentiles, 535-537, 553-554 


Perception of small differences, 
tables for scaling, 229 
Perceptual factor, 313 
Performance evaluation devices, 


126-127, 291 

Personal values, 98, 413-415, 417, 
420 

Personality, 3; attitudes significant 
for guidance, 98; dimensions of, 
78-81, 339-341 

Physical aspects, 17, 38-53; evalua- 
tion of, 258-281 

Physical defects, frequencies, 45 

Pigeon breast, 273 

Pintner-Cunningham Primary Tests, 
292, 293, 298 


571 


Pleasure to teacher in testing, 136 

Posture, 41, 261-263 

Pressey English Tests for Grades 
5 to 8, 302, 304 

Pre-testing, of essay questions, 248; 
of pupils, 491-492 

Primary abilities, 58, 63, 64, 291, 
313 

Probable error, 529-530; of statis- 
tical measures, 548 

Product and procedure evaluation, 
construction of devices for, 214- 
224; need for, 213-214; specific 
devices for, 224-241 

Product-moment coefficient of cor- 
relation, 542-544; standard error 
of, 549 

Professional aptitude tests, 331-333 

Professional growth in testing, 136 

Profile of mental ability, 64 

Progressive education, 5 

Progressive Education Association, 
24, 28, 145, 180, 382, 405 

Progressive Language Tests, 302, 
304 

Pryor Width-Weight 'Tables, 259 

Psychophysical methods, 220-224 

Publishers, list of, 561-562 

Pupil achievement, factors influenc- 
ing, 454 

Pupil changes, teacher evaluation 
based on, 452-455 

Pupils, aspects of, 17 

Purdue Placement Test in English, 
302, 304 

Purdue Rating Scale for Instructors, 
468, 481 

Purdue Reading Test, 302, 303-304 


Quantitative (Q-) score, 297 
Quality scale method, 251 


Quartile deviation, 391, 524-525; 
standard error of, 549 
Questionnaires, attitude, 394-396; 


available, 405-407; compared with 
attitude scales, 396-397; validity 
of, 397-404 

Quotient norms, 554-556 


r, see Correlation, 
Radio and adjustment, 113 


572 


Range of test scores, 514, 524 

Rank correlation, 539-541 

Rank order, method of, 220, 222-225 
Rapport, 348-349, 399, 427 

Rating devices, types of, 365-368 
Rating methods, adjustment evalua- 


tion, 361-375; reliability, and 
number of levels, 219; usefulness 
of, 374-375 

Rating Scale for Mechanical Draw- 
ings, 233 

Rational equivalence, method of, 
204-205 


Raw score, 510; limits of a single, 
514-515; meaninglessness of a sin- 
gle, 510-511; tabulation of, 516- 
517. 

Reading, age, 551; instructional ob- 
jectives in, 31-32; readiness, 299; 
tests, 299 

Rearrangement test items, 178-180; 
advantages and limitations, 179; 
scoring, 179—180 

Reavis-Breslich Diagnostic Tests in 
the Fundamental Operations of 
Arithmetic and in Problem Soly- 
ing, 305 

Recording grades, 11 

Recreation, and adjustment, 113; at- 
titudes related to, 99 

Refinement of construction, 139-141 

Regents’ Inquiry into the Character 
and Cost of Public Education in 
the State of New York, 404, 405 

Rejection, parental, 106 

Relationship, closeness of, 538; meas- 
ures of, 537-544 

Reliability of evaluation devices, al- 
ternative forms method, 202—203; 
defined, 20r; factors affecting, 
205-206; interpretation, 206-208; 
methods of estimating, 202-205; 
number of levels and, 219; ra- 
tional equivalence, method of, 204- 
205; scoring, 128-129; split-test 
method, 203-204. 

Religion, attitudes related to, roo 

Reports and Records Committee, 382 

Research, educational, 12 

Review of Educational Research, 307 

Rho, 539-540 


Index of Subjects 


Ringworm, 263 

Rinsland-Beck Natural Test of Eng- 
lish Usage, 302-304 

Rochester Athenaeum and Mechanics 
Institute, 232, 358 

Rogers Test of Personality Adjust- 
ment, 345, 354-356, 359, 361 


“s” see Specific factors. 

Sampling, of achievement, 129-130; 
of test items, 203 

Sampling error, 545 

"Sand" theory of mental organiza- 
tion, 57-58, 310-311 

Scabies, 263 

Scale for Measuring Attitude 
Toward Any Institution, 400-401 

Scale for the Measurement of At- 
titude Toward the Church, 400 

Scale for the Merit of Drawings 
by Pupils 8 to 15 Years Old, 229- 
230 

Scale of Beliefs: Tests 4.21 and 4.31, 
405-406; Tests 4.4 and 4.5, 406 

Scales, attitude, 391-394; compared 
with attitude questionnaires, 396- 
397; validity of, 397-404 

Scarlet fever, 264, 277 

Scatter diagrams, 541-542, 543 

Scholastic aptitude, 283-284 

School background and environment, 
114; evaluation of, 446-448 

School Health Study, 47-50 

Schooling-Clark-Potter Arithmetic 
Test, 305 

Science, specificity of achievement 
in, 20 

Science instruction, product evalua- 
tion in, 214 

Science Research Associates, 67 

Score Card for Judging Garments 
of 4-H Clothing Project I, 233 

Score cards, 214 

Scoring, arranging items for easy, 
190; graphic rating ‘scales, 369; 
Guess-Who test, 370; instructional 
values of, 133; Kuder Preference 
Record, 415; O.S.U. Psychologi- 
cal Test, 296; procedure in, 503- 
507; rearrangement tests, 179; re- 
liability of, 128—129 


Index of Subjects 


Scoring formulas to correct for 
guessing, 163-164 

Scoring key, preparation of, 191 

Scrambled sequence, arrangement of 
test items, 295 

Seashore Measures of Musical Tal- 
ent, 329, 330 

Secondary education, extent of, 8; 
instructional objectives in, 32-34 

Selection, of evaluating device, 122; 
of students, 5, 7-9 

Self-evaluation of teacher, 451 

Self-inventories, 347-361; usefulness 
of, 360-362 

Semesters-of-study norms, 553 

Semi-interquartile range, 524-525 

Short-answer tests, arranging, 187- 

191; for scoring, 190; order of 
difficulty, 189-190 

construction of, 146-193 
essay tests vs., 127—138 

Sigma, 525-526 

Significance, statistical, 545-550 

Simple-recall test items, 152-155 

Sims Score Card for Socio-economic 
Status, 437-438 

Skewness, 528 

Skin and hair, the, 41, 263-265 

Smallpox, 265, 280 

Snellen chart, 265-267, 268 

Social adjustment, see Adjustment. 

Social Attitude Scales, 406, 407 

Social control, 6 

Social distance, relation to attitudes, 
90-91, 370 

Social sensitivity, 99 

Social studies, attitudes important in, 
99; product evaluation in, 214 

Socio-economic status, 108-111; eval- 
uation, 434-441 

Sociometric test, 370-371 

Sorting or rating method, 251-252 

Soviet Russia, attitudes in, 112-113 

Space factor, 313 

Spatial analogies test material, 296 

Spearman-Brown formula, 204-206, 
420, 466, 475 : 

Spearman rank-difference coefficient 
of correlation, 539—541 

Special ability evaluation, theoretical 
bases of, 310-313 


573 


Special abilities, 17, 64-67; distin- 
guished from other aspects of 
pupils, 311-312; evaluation of, 
available tests for, 325-333; gen- 
eral function of evaluation, 323- 
325; theoretical bases of evalua- 
tion, 310-313 

Specific determiners, 161 

Specific factors, 59 

Specificity, of abilities," 57-59; of 
achievement, 20-22; of attitudes, 
91-94; of conduct tests, 376-377 

Speech defects, 274. 

Stability of aptitudes, 65 

Standard deviation, 525-526; inter- 
preting the, 526-529; standard 
error of, 547-550 

Standard error, 546-550; of a pro- 
portion, 546; of differences be- 
tween means, 547-550; of differ- 
ences between standard deviations, 
547-550; of quartile deviation, 
549; of score, 547; of Zo 549 

Standard scores, 534-535, 554 

Standardized tests, choosing, 194-212; 
teacher-made tests vs, 138—143 

Standards, maintenance of, 5-7; vs 
norms, 250, 551 

Stanford Achievement Tests, 
302, 411, 455, 467 

Stanford-Binet Intelligence Test, 208, 
289 

State University of Iowa, 62 

Statistical significance of measures 
obtained, 545-550 

Statistics, 12; interpretation of test 
scores, 510-560 

Stenquist Mechanical Aptitude Test, 
326, 327 

Stereotypes, 399 

Stiebling-Worcester Chart for Diag- 
nosing Defects in Buttonholes, 233 

Stoddard-Ferson Law Aptitude Ex- 
amination, 332 

Strip keys, 504-505 

Strong’s Vocational Interest Blank, 
413, 416-421 

Studiousness Questionnaire, 349 

Study of Values, 413-415 

Subject-matter fields, 1 

Suggestion effects, 132 


299, 


574 


Swarthmore College, 51 
Symposium on intelligence, 56, 69 


Table of specifications, 146—148 

Teacher, attitudes of, toward be- 
havior problems, 74-75; benefits 
to, in testing, 136; evaluation of, 
11-12, 450-485; health knowledge 
of, 49-50; mental abilities, role 
in evaluating, 282; personality of, 
rapport and, 488-489; physical 
aspects, role in evaluating, 48, 
258-259; rating scales, 467—468; 
recognition of health needs by, 47 

Teacher-made tests, standardized 
tests vs., 138-143 

Teaching, guidance of, 5, 9-10 

Teaching methods, appraisal of, 11- 
12 

Teeth, the, 42-43, 271-272 

Telebinocular, 265, 268, 303 

Ten-item City Yard-stick, 441, 442— 


444 

Tentative Checklist for Determining 
Attitudes on Fifty Crucial Social, 
Economic, and Political Problems, 


Terman-McNemar Test of Mental 
Ability, 289, 294-295, 298 
Test items, analysis, 288 
assembly into parts, 188 
composing, 148-150 
difficulty, arranging tests in order 
of, 189-190; determination of, 
287-288; of intelligence tests, 
287-288 
types of, 150-180; recall, 151—156; 
recognition, 150-152, 156—180; 
true-false, 151, 156-165 
'Test-retest method, 202 
Test scores, ranking, 512; relative 
nature of interpretation of, 51x 
Textbooks, appraisal of, 11-12 
Thermometer, clinical, 274-276 
Thirty-ninth Yearbook of the Na- 
tional Society for the Study of 
Education, 62 
Thorndike Handwriting Scale, 219, 
225, 227, 233 
Thorndike process, for scaling per- 


Index of Subjects 


centages of preference, 229-230, 
231, 235 

‘Throat, the, 43, 272-273 

‘Thurstone Personality Schedule, 357 

Thurstone Scales for the Measure- 
ment of Attitude, 392-393, 394 

Tics, 274 

Tonsils, the, 272-273, 280 

Torgerson Teacher Rating Scale, 467 

Training raters, 363-365 

Trait, 312 

T-scores, 535, 554 

Twenty-seventh Yearbook of the 
National Society for the Study of 
Education, 6% 

Twins, identical, 61 

Two-factor theory, 59 


U. S. Civil Service Commission, 329 
U. S. Employment Service, 314-321 
U. S. Office of Education, 446 
University of California, 43 
University of Chicago, 496; Board 
of Examiners, 8 
University of Minnesota, 42, 43, 515 
College of Science, Literature, and 
the Arts, 8; General College, 8-9 
Uses of tests and measurements, 5 


Validity, 195—201; of achievement 
evaluation, 197-200; of adjustment 
evaluation, 345-346; of attitude 
evaluations, 397-404; of intelli- 
gence tests, 287—288 

Values, personal, 98, 413-415 

Van Wagenen Reading Readiness 
Test, 301, 302 

Variability, measures of, 523-526 

Verbal abilities, 298—304. 

Verbal analogies test material, 295— 
296 

Verbal factor, 313 

Verbal tests, 291 

Vineland Social Maturity Scale, 373- 
374, 375 

Vision, 41~42, 265-269 

Vocabulary test material, 293-294 

Vocational curricula, intelligence of 
students in, 63 

Vocational guidance, 310 

Vocational interests, 99 


Index of Subjects 


Vocations, nature of American, 314- 
323 


Watch-tick test, 270 

Weight, measurement of, 260 

What Should Our Schools Do?, 466, 
467 

What Would You Do?, 404-405, 406 

Whispered speech test, 270 

White House Conference on Child 
Health and Protection, 46, 106, 109 

Who's Who in America, 229 

Willing Scale for Written Composi- 
tion in Grades IV to VIII, 234 

Window stencils, 505 


575 


Winn Analytic Sewing Scale, 233 
Winnetka Scale for Rating School 
Behavior and Attitudes, 372-373, 


375 
Worker characteristics of jobs, 316- 


319 
Wrightstone Scale of Civic Beliefs, 


394, 396, 397, 404, 406, 425 


Young-Estabrooks Studiousness Scale, 
419-420 


z,, standard error of, 549 
z scores, 534-535, 554 


INDEX OF NAMES 


Ackerson, L., 82 

Aikin, W. L, 192 

Allport, F. H., 91, 101, 357 

Allport, G. W., 71, 82, 91, 92, 94-95, 
IOI, 339, 357, 413, 417, 420 

Amatora, M., 393 

Anastasi, A., 59, 68 

Anderson, C. J., 508 

Anderson, R. G., 298 

,Armacost, D. H., 508 

Ayres, L. S., 225-227 


Ballou, F. W., 234, 240 

Bare, T. H., 397, 422 

Barr, A. S., 467, 468, 484 

Baruch, D. W., 104, 115, 427, 428, 
448 

Baumgarten, F., 290 

Beck, R. L., 302, 304 

Beckman, R. O., 434, 448 

Beers, F. S., 24, 37 

Beery, J. R., 397, 422 

Bell, H. M., 320, 334, 359-360, 420 

Bennett, G. K., 357, 386 

Bernreuter, R. G., 340, 356-359, 386, 
420 

Betts, E. A., 301, 302 

Binet, A., 60, 67, 68, 283, 284 

Bingham, W. V., 63, 66, 68, 86, 88, 
TOI, 290, 309, 324, 333, 334 

Bogardus, E. S., 370, 386 

Boucher, C. S, 136, 144 

Bowman, E. C., 471, 484. 

Bowman, K. S., 75, 82 

Boynton, R., 42, 43 

Brandenburg, G. C., 386, 484 

Breemes, E. L., 112, 115 

Breslich, E. R., 305, 306 

Briggs, T. H., 84, 101, 422, 508 

Bronner, A. F., 309 

Brooks, F. D., 231, 240 

Brown, C. M., 21, 35, 129, 144, 220, 
233, 240 


Brown, M., 40, 52 

Brown, W., 204, 205, 420 
Bruel, O., 113, 115 

Bruner, H. B., 407, 447, 448 
Bryan, R. C., 476, 479, 484 
Buck, W., 112, 115 

Bues, H. W., 393 1 
Burgum, M., 106, 116 
Burks, B. S., 68 

Burks, F. W., 51, 52 
Burks, J. D., 51, 52 

Buros, O. K., 200, 212, 307, 309 
Burtt, H. E., 410 


Campbell, D. S., 2 

Canning, L., 419, 422 
Cantril, H., 92, 101 

Capron, V. L., 190, 192 
Carey, G. L., 230, 240 
Carter, H. D., 419, 422 
Carter, R. E., 244-246, 251, 257 
Case, A. T., 407 

Caswell, H. L., 35 

Cattell, J. McK., 229, 240 
Chapin, F. S., 108, 116 
Chapman, J. C., 240, 500, 508 
Chave, E. J., 392, 393, 400, 424 
Clapp, F. L., 508 

Clark, H. F., 453, 484. 

Clark, J. R., 305, 306 

Clark, W. W., 301, 302 
Cleeton, G. U., 412-413, 417 
Clouse, V. R., 393 

Collins, J. H., 109, 116 

Cook, L. A., 104, 115 

Cook, W. W., 206, 212 
Corey, S. M., 397, 422 
Cowley, W. H., 5, 14 
Cronbach, L. J., 192 
Cunningham, B. V., 292, 293 
Curtis, E. A., 105, 116 | 
Curtis, F. D., 133, 144 


576 


Index of Names 577 


Darley, J., 420, 423 

Davenport, K., 467 

Davis, J., 112-113, 116 

Davis, V., 302, 304 

Dearborn, W. F., 56, 508 
Deaver, G. G., 44 

DeBoer, J. J., 113, 116 

Dewey, J., 91, 101 

Dimmit, M., 393 

Doll, E. A., 375 

Douglass, H. R., 109, 116, 131, 144 
Drake, R. M., 330 

Draper, E. M., 35 

Droba, D. D., 392 

Duffy, E., 414, 423 

Dunlap, J. W., 411-412, 417, 418, 424 
Dykema, P. W., 330 


Eames, T. H., 265, 267-268, 281 
Eckert, R., 404-405 

Ellingson, M., 377, 380, 384, 387 
Elliott, E. C., 129, 145 
Estabrooks, G. H., 419, 425 
Eurich, A. C., 20, 35 


Ferguson, G. A., 202, 212 
Ferson, M. L., 332 
Fitz-Simons, M. J., 429, 448 
Flanagan, J. C. 142, 144, 340, 341, 
357, 386, 410, 423, 459, 484 
Flexner, A., 5, 14 
Forlano, G., 387 
Fotos, J. T., 20, 36 
Franzen, R., 47, 48, 49, 52 
Frederick, O. L, 447, 449 
Freeman, F. N., 227, 240, 309 
Friedberg, J., 349, 387 
Fryer, D., 101, 409, 410, 423 
Fullerton, G. S., 229, 240 


Gage, N. L., 212 

Gallup, G., 397 

Galton, F., 55, 69 

Gard, P. D., 497, 509 
Garretson, O. K., 408, 412, 417, 423 
Garrett, H. E., 309, 560 
Glaser, E., 415, 417, 423 
Goodykoontz, B., 445, 449 
Gordon, K., 236, 240 
Gray, W. S., 303, 309 
Greene, H. A., 217, 233, 241 


Grice, H. H., 393 

Guilford, J. P., 222-224, 240, 340, 
341, 386 

Guilford, R. B., 340, 341, 386 


Haggerty, H. E, 56, 371, 373, 375 

Hall, G. S., 85 

Hall, W., 401, 423 

Hanford, A. C., 135, 144. 

Harper, M. H., 423 

Harris, A. J., 36 

Hartog, P., 129, 145 

Hartshorne, H., 92, ror, 370, 376, 
386 

Hathaway, S. R., 358, 387 

Hawkes, H. E., 36, 145, 192 

Healy, W., 309 

Henmon, V. A. C., 56, 289, 295 

Herbst, R. L., 508 

Herrick, V. E., 92, 101 

Hildreth, G. H., 292, 307, 309 

Hillegas, M. B., 233, 234, 240 

Hinckley, E. D., 392, 397, 423 

Hipskind, M. J., 475, 484. 

Hoag, E. V., 51, 52 

Holley, C. E., 109, 116 

Hollingworth, H. S., 223-224, 240 

Horne, E. P., 97-98, ror 

Hoshaw, L. B., 393 

Hudelson, E., 235, 240, 251 

Hull, C. L., 310, 311, 334 

Hunt, T., 332 

Hutchins, R. M., 5, 14 


Jackson, R. W. B., 202, 212 
James, A. W., 129, 145 
James, W., 85 

Jarvie, L. L., 358, 377, 380, 384, 387 
Jenkins, R. L., 107, 117 
Jensen, M. E., 508 

Jenss, R. M., 259, 281 
Johns, A. A. 358, 387 
Johnson, A. P., 421, 423 
Johnson, P. O., 20, 36 
Jones, E. S., 135, 145 
Judd, C. H., 26, 36, 37 
Jung, C. G., 80, 82 


Kandel, I. L., 331, 334 
Karslake, R., 212 
Kasanin, J. N., 82 


578 Index of Names 


Katz, D., 392 
Kefauver, G. N., 69 
Keller, F. J., 290 


Kelley, I. B. 393, 394, 400-401, 423; 


461, 464, 484 

Kelley, T. L., 58, 145, 207, 212, 286, 
287, 295, 309, 321, 335, 476, 484 

Kelley, V. H., 169, 192 

Kemmel, H., 508 

Kent, G. H., 325, 326 

Kerr, W. A. 108, 110n., 116, 438, 
439, 440 

Keys, N., 353 

King, W. L, 449 

Klein, A., 263, 281 

Kline, L. W., 230, 231, 240 

Knauber, A. J., 331 

Knox, L. V., 497, 509 

Kopas, J. S., 423 

Kornhauser, A. W., 109, 116 

Krey, A. C., 132-133, 145 

Kroll, A., 484 

Krout, M. H., 107, 116 

Kuder, G. F., 204, 212, 415, 416, 417 

Kuhlmann, F., 298 

Kwalwasser, J., 330 


Laird, D. A., 357 

Leahy, A. M., 439, 449 

Leary, B. E., 22, 23, 28, 29, 30, 36 

Lee, D. M., 301, 302 

Lee, J. M., 143, 144, 149-150, 192, 
301, 302 

Lehman, H. C., 500, 508 

Lentz, T. F., 407 

Lewerenz, A. S., 331, 407 

Likert, R., 376, 387, 397, 423 

Limbert, P. M., 407 

Limp, C. E., 311 

Linden, A. V., 407 

Lindquist, E. F., 36, 137, 145, 206, 212 

Loofbourow, G. C., 353 

Lowe, G. M., 309 s 

Lurie, W. A., 414, 423 

Lynd, H. M., 112, 116 

Lynd, R. S., 112, 116 


McAdory, M., 330-331 
McCarty, S. A., 230, 240 
McConnell, T. R., 20, 21, 36 
McFarland, M. B., 433, 449 


McHale, K., 410 

McMurray, F. I., 53 

McNemar, Q., 289, 294 
MacVaugh, G. S., 497, 509 
Maller, J. Bi, 351, 386, 413, 417, 423 
Mann, C. R., 36 

Mason, H. M., 484 

May, M., 92, 101, 370, 376, 386 
Meier, N. C., 330, 331 

Meltzer, H., 111, 116, 429-430, 449 
Merrill, M. A., 309 

Meyer, G., 131, 145, 250, 257 
Miller, D. B., 228, 240 

Miller, H. E., 393 

Monroe, W. S., 244-246, 251, 257 
Moreno, J. L., 370, 387 

Morgan, C. L., 112, 115 

Moss, F. A., 332 

Mowrer, H., 105, 116 

Murdoch, K., 233 

Murphy, G., 101, 376, 387, 424 
Murphy, L. B., 101, 424 

Myers; €. R., 75, 76, 82 

Myers, T. R., 428, 449 


Nelson, E., 98, 1or 

Nelson, M., 289, 295 

Nemzek, C. L., 105, 116 

Newcomb, T. M., 94, 95-96, 101; 397, 


424. 
Newkirk, L. V., 217, 233, 241 
Noble, M. C. S., 446, 449 


O'Connor, J., 327, 328 

Odbert, H. S., 71, 82, 339 

Odell, C. W., 135, 145, 193, 238, 241, 
251, 257 

Olson, W. C., 371, 373, 375, 387 

Orleans, J. B., 305, 306 

Orleans, J. S., 305, 306 + 

O'Rourke, L. J., 328-329, 410 

Otis, A. S., 218 


Paterson, DiG., 38-39, 53, 333 
Perkins, K. J., 461, 464, 484 
Peters, C. C., 19 

Peters, F., 401, 424 

Peters, M. R., 401, 424 

Peterson, B. M., 418, 424 
Peterson, R. A., 360, 387- 
Peterson, R. C., 392, 393, 400, 424 


Index of Names 


Peterson, T. D., 110, 116 

Pintner, R., 56, 292, 293, 298, 387, 509 
Potter, M. A., 305 

Pressey, S. L., 302, 304 


Quackenbush, G.. M., 228, 241 


Raths, L. E., 28, 36, 181, 184, 193 

Ream, M. J., 410 

Reavis, W. C., 305, 306 

Remmers, H. H., 20, 36, 101, 108, 
IIO. II2, IIS, 206, 212, 386, 392- 
394, 401, 402, 403, 404, 424, 438, 
439, 440, 472, 475, 482, 484, 491, 


509 

Richardson, M. W., 178, 193, 204, 212, 
416 Ye 

Rinsland, H. D., 302, 304 

Risen, N. L., 105, 117 

Robertson, CHA., 387 

Rock, R. T., 418, 424 

Rogers, C. R, 3547356, 387 

Rogers, J. F., 40, 43, 46, 53, 27% 281 

Rosander, A. C., 393, 407 

Ross, C. C., 497, 509 

Ruch, G. M., 135, 145, 236, 241, 287, 
309 1 

Rugg, H. O., 227-228, 241 

Rundquist, E. A., 111, 117 

Rush, G. P., 240 

Ryans, D. G., 484 


Sageser, H. W., 212 

Saint Clair, W. F., 358, 387 
Samuel, E. A., 347 
Schmalzried, N. T., 481, 484 
Schneck, M. F., 309 
Schneidler, G. G., 333 
Schorling, R., 305, 306 
Schrammel, H. E., 302, 304 
Schultz, R. N., 497, 509 
Seashore, C. E., 329, 330, 331 
Seegers, J. C., 358, 387 

Segel, D., 143, 145, 149-150, 192, 287, 


309 
Shaffer, L. F., 71, 82 
Shakow, D., 325, 326 
Shartle, C. L., 315, 316, 317, 335 
Sheppard, E. M., 129, 145 
Sheviakov, Gi, 349, 387 
Shimberg, M. E., 309 


579 

Silance, E. B., 393 

Simon, T., 68 

Simpson, M., 429, 449 , 

Sims, V. M., 180, 193, 251, 257, 4377 
438, 497, 509 

Sletto, R. F., 111, 117 

Smith, B. O., 16 

Smith, C. W., 508 

Smith, E. R., 193 

Smith, H. N., 393 

Souther, S. P., 259, 281 

Spearman, C., 59, 62, 67, 69, 204, 205, 
420, 539 

Speer, G. S., 358, 387 

eg D., 338, 342, 343-345, 355 
3 

Spranger, E., 98, 101, 413 

Springer, N. N., 110, 117 

Stagner, R., 105-106, 117, 357, 388, 

= 397) 399, 424 

Stalnaker, J. M., 169, 193, 254, 257, 
496, 509 

Stalnaker, R. C., 169, 193, 496, 509 

Starch, D., 129, 145 

Stead, W. H., 317, 335 

Steinbeck, J., 445 

Steinmetz, H. C., 407 

Stoddard, G. D., 236, 241, 332 

Stogdill, E. L., 358, 388 

Stott, L. H., 431, 449 

Stouffer, S. A., 400, 424 

Strang, R., 63, 69 

Strong, E. K., 413, 416-421, 424 

Stump, N. F., 509 

Sumner, W. G., ror 

Super, D. E., 357, 388 

Sweet, L., 342, 388 

Sydenstricker, E., 436, 449 

Symonds, P. M., 106, 117, 206, 212, 
241, 347, 349, 375, 388, 408, 412, 
417, 424, 429, 449 


Tallmadge, M., 131, 144 

Taylor, K. V. F., 419, 422 

Terman, L. M., 51, 52, 56, 60, 69, 
289, 294, 309, 410 

"Terry, P. W., 130-131, 145 

Theisen, W. W., 235, 241 

"Thiele, M. B., 393 

"Thomas, D. M., 393, 401 

Thomas, L. C., 263, 281 


580 


Thomas, M. E., 358, 388 

'Thompson, C., 78 

Thomson, G. H., 59, 69 

Thorndike, E. L., 4-5, 14, 56, 57 58, 
67, 69, 219, 225, 229, 230, 231, 234 
235, 241, A41, 442, 449 

Thurstone, L. L., 56, 58, 64, 69, 107, 
110, 117, 223, 241, 291, 296, 309, 
412, 335, 357, 392-394, 400, 40%, 
404, 413, 424, 474 

"Thurstone, 'T. G., 296, 357, 393 

"Tiebout, C., 241 

Toops, H. A., 296, 309, 410, 506 

Trabue, M. R., 234, 241 

Traxler, A. E., 78, 80, 83, 377; 37% 
384, 388 

Trimble, O. C., 193 

"Tschechtelin, M. A., 472, 475, 485 

'Tussing, L., 420, 424 

Tyler, R. W., 5, 14, 19, 21, 26, 27, 36; 
37, 147, 173, 180, 236-238, 241 


Van Alstyne, D., 375 

Van Wagenen, M. J., 241, 301, 302 

Veo, L., 82 

Vernon, P. E, 92, 101, 136, 145, 413, 
417, 420 2 

Vogel, B. K., 393 


Index of Names 


Wang, C. K. A. 393 

Ward, R. S., 317 

Ward, W. D., 482, 485 

Warren, H. C., 64 

Weidemann, C. C., 247, 257 
Wesman, A., 418, 424 

West, P. V., 193 

Wheeler, R. H., 57, 69 

Wickman, E. K., 78, 81, 83, 371, 372. 


375 
Wiley, L. N., 193 
Williamson, A. C., 403, 424 
Williamson, E. G., 5, 14, 53, 333 
Willing, M. H., 234, 241 
Wilson, H., 404-405 
Wissler, C., 410 
Witmer, H. L., 106-107, 117 
Wolfle, D., 335 
Wood, B. D., 20, 26, 37 
Woodrow, H., 56 
Woods, G. G., 133, 144 
Wrightstone, J. W., 394, 397) 494 
406, 425 


Yntema, O., 476, 479, 484 
Young, C. W., 419, 425 
Young, K., tor 

Young, R. V., 508 
Yourman, J., 83 


E 
Esdrae sa SE RIS | Ee Were ee aS 
Form No. 3. 
* o PSY, RES.L-1 


-Bureau of Educational & Psychological 
Research Library. 
E ——— . 
The book is to be returned within 
the date stamped last. 


1 6 MAR 1961 


~ B JUN:962 


WBGP-59/60-51190-5M- 


